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Some Aspects of Nonresponse Adjustments 


R. PLATEK and G.B. GRAY! 


ABSTRACT 

Unit and item nonresponse almost always occur in surveys and censuses. The larger its size the larger 
its potential effect will be on survey estimates. It is, therefore, important to cope with it at every stage 
where they can be affected. At varying degrees the size of nonresponse can be coped with at design, 
field and processing stages. The nonresponse problems have an impact on estimation formulas for various 
statistics as a result of imputations and weight adjustments along with survey weights in the estimates 
of means, totals, or other statistics. The formulas may be decomposed into components that include 
response errors, the effect of weight adjustment for unit nonresponse, and the effect of substitution 
for nonresponse. The impacts of the design, field, and processing stages on the components of the 
estimates are examined. 


KEY WORDS: Nonresponse; Imputation; Estimation. 
1. INTRODUCTION 


As survey data are gathered from sampled unit, unit and item nonresponse will occur for 
at least some units despite all efforts to avoid it. The problem of dealing with nonresponse 
and the resultant missing data is two-fold. First, the effort through callbacks, repeated mail- 
ings etc. must be determined to the extent that it is cost-effective in reducing the mean square 
error of survey data and second, for the remaining nonresponse, the adjustments for the missing 
data must be obtained in order to reduce the nonresponse bias. 

The field or survey centre effort to reduce or minimize unit nonresponse often means repeated 
attempts to contact selected units until a responsible person is available to reply to the survey 
questionnaire. The attempts pertain either to personal or telephone interview. In the case of 
mail surveys, repeated attempts mean successive mailings of a survey questionnaire to nonrespon- 
ding units. In some cases, the repeated attempts may result in telephone or personal follow- 
ups. Some nonresponse is inevitable although every reasonable attempt should be made to 
minimize its levels. Thus, there will always remain some nonrespondents for whom all the ef- 
forts to convert them seem insufficient or inappropriate. The result is some imputation pro- 
cedure to account for the missing data. This paper addresses the problems of controlling 
nonresponse at the design and field stage, followed by an examination of nonresponse ad- 
justments at the processing stage. The examination will consider the feasibility and the prac- 
tical as well as the methodological issues pertaining to the nonresponse adjustments. 

Item nonresponse is often a more complex problem to deal with than unit nonresponse which 
is the type mostly referred to above. The most important factors which may reduce item 
nonresponse are good questionnaire design and a high quality of interviewers through proper 
hiring and training. A poorly designed questionnaire may also result in problems of following 
or completing the proper sequence of questions, whether by an interviewer or in a self-interview 
situation. Consequently, item nonresponse may occur in a questionnaire without the interviewer 
or respondent being aware of it. In addition, respondents may be willing to answer some but 
not all questions in a survey. Whatever the reason for missing items, the problems of substituting 
for them remains. Usually, a survey organization is unwilling to throw out whatever information 
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has been obtained unless of course the responses to major items appear very faulty or illogical. 
Thus, other means of imputing for missing items while maintaining the partial information 
on the records are usually undertaken. 

Various statistics are required from a survey or census to explain social phenomena, deter- 
mine socio-economic policies, etc. These include means, totals, ratios, distributions, percen- 
tiles and graphs. The statistics are assumed to be based on a universe of N units that belong 
to the target population; where N may or may not be known. 

It may be demonstrated that all of the statistics mentioned above may be expressed in 
terms of totals or counts. Consequently, the remainder of the article will deal with missing 
data as they affect estimates of totals and counts in surveys. Some references to censuses 
will also be made. 


2. ESTIMATION FORMULA 


In the presence of unit and item nonresponse, the estimate of the total of characteristic 
y may be given by the general expression as in (2.1) below. 


= N 

Y= Y tary’ {5{5, Yrt(L — dy)zy] +1 — d)z, \, where (2.1) 
t; = 1 or O according as unit / is selected or not, 

mw; = probability that unit 7 is selected. 

6; = 1 or 0 according as unit i responds or not, 

6, = 1 or 0 according as responding unit i responds to item or 


characteristic y or not, 
y; = observed response for characteristic y when 6, = 6; = 1; 
y; May or may not = Y,, the true value, 
Ziy = imputed value for item nonresponse, when 6;=1, 6,,=0. 
Z; = imputed value for unit nonresponse when 6;=0. 


The above estimate may pertain to a class a of units, when one inserts the indicators variable 
Biq equal to 1 or O after z;' to indicate whether or not unit i belongs to class a (e.g., age-sex 
class a). 

In the case of item nonresponse, z,, is nearly always an explicit imputed value for the 
missing information. The imputed value may be obtained by (i) a hot deck procedure i.e., 
substitution of an available response of characteristic y from the survey questionnaire of 
another unit that responded with respect to the characteristic and that is as similar as possi- 
ble to unit / according to a decision table, (ii) substitution from other sources of data from 
the same unit such as an earlier survey, census, or administratrive data if such data are 
available, (iii) by regression methods or (iv) by logical deduction and the list is by no means 
exhaustive. In some cases, systematic errors may occur from, for example, faulty coders or 
keypunchers. In such cases one attempts to change the codes to logical values relative to 
other information on the questionnaire in place of imputation. In any case, one hopes to 
achieve an imputed value or altered code as close to the true value Y; as possible. In the 
case of continuous surveys, with characteristics that are stable over a long period of time 
(such as employment in some industries and occupations), the response or earlier survey data 
may be considered almost as good as that of current survey data for the same unit. This 
would be especially when the reference periods of the current and earlier survey data are 
not too far apart in time. This may be also true in the case of survey data one year apart 
in the case of seasonal characteristics such as, for example, those related to the fishing 
industry. Sometimes the imputation of earlier survey data may be used also for unit 
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nonrespondents that were respondents previously and with stable characteristics. 
Usually, in the case of unit nonresponse, the imputation is undertaken by weight adjust- 
ment by the inverse response rate in a cell or area. The estimate of total is then given by: 


x N 
y=" y, tn; (wa)sis,y, + (1 = 6,)z,] (2.2) 


where (wa); = weight adjustment for unit i to compensate for the deficient sample due to 
unit nonresponse. In the above expression, it is assumed that all item nonresponse has already 
been imputed for by z, in the case of responding unit i when 6, = 0. 

The estimates of the cumulative distribution function from the sample in the context of 
potential missing data may be obtained by replacing the observed value y, by the indicator 
variable c(y;,Y) = 1 or 0 according as y, < or > Y and similarly for z, and z,. The 
estimated c.d.f.’s corresponding to (2.1) and (2.2) are respectively given by (2.3) and (2.4) 
below. 


FY) = A. ¥ tir {5[8,cn¥) + (1 - 6,)e@,.¥)] + CU - doe, H} 2.3) 


where N = Y%, t;x;' denotes the estimated or the true count of units in the universe. 
Thus, depending upon the frame, sample design, and listings of units, N may or may not = N. 


FY) = 1 tarp \Qwa)d,[dc0%.¥) + (1 - 6,)e@»¥)] (2.4) 


i 
N 


While Y, as defined in (2.1) and (2.2), is identical according as to whether imputation 
for unit nonresponse is regarded as a substitution of mean values of respondents or as a 
weight adjustment, the c.d.f. estimates, F(Y) as defined in (2.3) and (2.4), are not identical. 
When the mean of respondents, either overall or in adjustment cells defined for compensa- 
tion of nonresponse, is substituted for each missing value as in (2.1) or (2.3), there results 
a spiking of such mean values in the estimated c.d.f., not reflecting the real shape of the 
c.d.f. in the population. The use of the weight adjustment (wa), , to inflate the sample 
weight z; ' in (2.4) avoids this spiking effect, yielding a different but more realistic estimate 
of the c.d.f. 

Under full unit and item response, the estimates (2.1) and (2.2) simplify to the Horvitz- 
Thompson (1952) estimate of the total, which is unbiased apart from response errors. In 
the presence of missing data and imputation for them, the estimates (2.1) and (2.2) however 
are likely to be biased for reasons other than response errors unless z,,’s and z;’s tend to 
equal y,’s when imputation for either item or unit nonresponse is required. 

In the next section, the estimates (2.1) and (2.2) are decomposed into various components 
due to response error, imputation error due to item nonresponse, imputation error due to 
unit nonresponse and the effect of weight adjustments exceeding one. 


3. Components of the Estimate 


The estimate Y given by (2.1) or (2.2) may be split up into 5 components, beginning with 
the Horvitz-Thompson estimate using the true values of the characteristic as in Table 1. The 
estimated c.d.f. F(Y) as in (2.4) may be similarly split up but will be omitted in this paper. 

When the weight adjustment (wa); = 1, the last line cancels out and the first 4 lines (3.1) 
to (3.4) total the estimate as given by (2.1). When the unit nonresponse is compensated for 
by a weight adjustment (wa); > 1, there is no direct substitution z,; for the missing value 
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Table 1: 
Components of the Estimate Y 


be N 
Y Serre .. unbiased estimate based Or) 
$ on full response, with 

true values 


N 
+ Y ta '0; - Y) .. effect of response (3.2) 
a error 


N 
+ ¥ ta; 6(1 — 6,)@, —y) .. effect of item (3.3) 
as nonresponse 


N 
+ Y¥ita;' — 6) - yD ep effect ofsunit (3.4) 
Se nonresponse 


N 
ae SY, t;m; '[(wa); = 1]6 5,9; + (1 - Siy )Ziy | .. effect of weight (3.5) 
=e adjustment for unit 
nonresponse 


and z; is taken to be 0 in (3.4). In that case, the 5 lines total the estimate as given by (2.2) 
and the negative effect of unit nonresponse in (3.4) is compensated for by the positive effect 
of weight adjustment in (3.5). 


(a) Response error 


The sum of the Ist and 2nd lines of the estimate Y (See 3.1 and 3.2) equal the desired 
Horvitz-Thompson estimate of total under full response. The observed response y; for unit 
i may not equal the true value Y; so that a response error at unit 7 level may result. The 
response error, which is not the real subject of this paper, can only be reduced, though not 
likely eliminated, by proper interviewer training, good questionnaire design with unambiguous 
definitions of characteristics and questions and without cluster that would confuse the inter- 
viewer and/or respondent. 

When the sampled weighted response errors of (3.2) do not cancel out, the estimate of 
the total Y under full response, contains response error and upon taking expected value over 
all possible samples and response FE, and E; (See Platek and Gray 1983), it may be found 
to be subject to response bias B, and response variance in addition to sampling variance 
(SV). The response variance may be decomposed into simple (SRV) and correlated response 
variance (CRV) components. 

The response bias, and all of the variance components (SV), (SRV) and (CRV) for the 
above estimate are derived in Platek and Gray (1983), subsection 2.2, pp. 257-8. 

Response errors are usually studied by means of a reconciled reinterview program, whereby 
a subsample of responding units are reinterviewed and any observed differences between the 
original and reinterview data pertaining to the sample reference period are reconciled to deter- 
mine which of the original or reinterview is the correct response. Reconciled reinterview surveys 
are undertaken in both the Canadian Labour Force Survey and the U.S. Current Population 
Surveys (CPS), two similar monthly surveys to measure unemployment,employment. etc. 
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For example, Poterba and Summers (1984), present in Table 2 some CPS results for a 
reconciled Reinterview Survey of May, 1976, based on a subsample of 3,329 men and 3,750 
women. By means of reconciliation of a reinterviewed subsample, the ¢rue status of an in- 
dividual is obtained so that it can be determined whether or not that individual responded 
correctly or not in the original survey, which in this case is CPS. Thus,the number of in- 
dividuals with the true characteristics Employed in the reconciled interview sample who were 
actually reported as Employed, Unemployed, or Not in the LF in the original survey may 
be determined. From the three numbers, the proportion (or the probability) of correct and 
incorrect responses by true LF status may be estimated as in the table below. 

Thus, for all of the men who were actually unemployed, 0.8720 is the estimated propor- 
tion of such men according to the reconciled reinterview study, who were accurately reported 
as unemployed while (0.0474 + 0.0806) or 0.1280 of the unemployed men were incorrectly 
reported as either Employed or not in the Labour Force. Thus, if y denotes characteristic 
unemployed i.e. Y; = 1 when individual no. i is actually unemployed and a male then 
y; = 1 correctly with probability 0.8720 while y; = 0, incorrectly with probability 0.1280. 

In the Canadian Labour Force Survey, the reconciled reinterview study sample during 
Jan.-Nov., 1984 covered 7,148 individuals and the corresponding probabilities of reporting 
labour force status as employed, unemployed or NILF in the regular LFS by ¢rve status as 
determined by the reinterview during 1984 are given in Table 3 below. 

Thus the probability of correctly labelling an individual as unemployed, given that he/she 
actually unemployed is estimated to be .8691 in LFS compared with .8602 in CPS, almost 


Table 2 
Probabilities of Reporting Labour Force Status as Employed, 
Unemployed, or NILF in the Regular CPS, by True Status as 
Determined by the Reinterview Survey, May 1976. 


Status as Reported in the Regular CPS 


True Status Employed Unemployed NILF 
Total’ 
Employed 0.9905 0.0016 0.0079 
Unemployed 0.0356 0.8602 0.1041 
NILF 0.0053 0.0025 0.9923 
Men? 
Employed 0.9922 0.0013 0.0065 
Unemployed 0.0474 0.8720 0.0806 
NILF 0.0062 0.0048 0.9890 
Women’ 
Employed 0.9892 0.0019 0.0089 
Unemployed 0.0194 0.8442 0.1363 
NILF 0.0049 0.0015 0.9936 
! Sampling size = 7,079 
2 Sampling size = 3,329 
3 Sampling size = 3,750 


Source: Tables were computed from ‘‘General Labour Force Status in the CPS Reinter- 
view by Labour Force Status in the Original interview. 
Both Sexes. Total. After Reconciliation. 
May 1976, Bureau of the Census (unpublished) 
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Table 3 


Number of Individual and Probabilities of Reporting LF Status 
(in brackets) by True Characteristic. Jan.-Nov. 1984 


True LF Regular LFS 

Characteristic 

(Reconciled Employed Unemployed NILF Total 

reinterview) 

Employed 4,082 19 Sil 4,152 
(0.9831) (0.0046) (0.0123) 

Unemployed 8 S71 78 657 
(0.0122) (0.8691) (0.1187) 

NILF 28 30 2,281 2,339 
(0.0120) (0.0128) (0.9752) 


Total 4,118 620 2,410 7,148 


the same. The corresponding probabilities for Employed and Not in the Labour Force in 
LFS are estimated during 1984 to be .9831 and .9752 compared with .9905 and .9923 for 
CPS, both somewhat lower in LFS. The reason for the difference cannot be determined at 
this stage. In any case, the response errors are likely more serious at national than at small 
area levels. For example, at national levels the response biases may be larger in magnitude 
relative to their sampling errors while a small area level estimate may be subject to response 
biases of about the same percent as at national level, but which may be much smaller than 
the sampling errors. 


(b) Item Nonresponse and Imputation Error 


The third line (3.3) of the estimate Yin Table 1 showed the deviation from the desired estimate 
Y as a result of imputation for item nonresponse when the imputed value Ziy # y; and when 
the sampled weighted differences (z,, — y,) over the sampled units with imputations for item 
nonresponse do not cancel out. Item nonresponse results from a respondent refusing to answer 
certain questions on the questionnaire may have been inadvertently left incompleted by either 
the respondent (in the case of self-enumeration) or by the interviewer. The second of the 
two causes of item nonresponse may result from similar causes as for response errors; i.e. 
complex questions with ambiguous definitions and/or an involved or cluttered questionnaire 
with a tendency for potential errors in following the proper path, depending upon replies 
to filter questions. 

When item nonresponse does occur, an imputation strategy as described earlier may be 
undertaken, which almost always results in an explicit substitution. Crucial to data analysis 
at micro-levels is the need to obtain a value z,, as close to the true value Y; or at least as 
close to what would be the observed y; , if the unit had responded to the question(s) that 
determine(s) characteristic y. There is unfortunately no way of knowing how close z,, agrees 
with y, except through re-enumeration of the unit, or a review and study of external sources 
or earlier survey data (which may not be available). The further danger of item nonresponse 
and the imputation for it may be the false sense of security to the data user who may not 
be aware or who may not be informed of the substituted value z,, in place of a bonafide 
response at the micro-data level. The imputed value z,, will tend to deviate in either direc- 
tion from the true value Y,; to a greater extent than the potential response error y, if that 
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unit responds to the characteristic. This may not always be the case. Unfortunately, it usual- 
ly cannot be determined at the micro-level whether or not z,, is less accurate than y; would 
be. Even if the imputation error may sometimes be lower than the potential response error, 
it may further deteriorate the quality of the published statistics because of the presence of 
additional variance components. 

Item nonresponse and response errors are often detected in the LFS by a monthly project 
Field Edit Module which analyzes questionnaires that failed edit for one or more questions. 
The distinction between response errors and item nonresponse however is often quite blur- 
red in the analysis without probing into the individual questionnaires in detail. The common 
type of discrepancy is a miscoding of a question rather than item nonresponse per se. Many 
questions are split up into 5 or 6 different sub-categories and a miscoding may be interpreted 
as an item nonresponse for one sub-category and a response error for another sub-category 
pertaining to the same question. The analysis of the Field Edit Module deals with items (ques- 
tions) but not sub-categories of the questions. The item discrepancy rate is thus difficult to 
define unambiguously. It pertains to a subset of questionnaires for which a specific ques- 
tion, say, No. q is relevant according to filter questions and decision tables. Let us suppose 
that out of a responding sample size of m questionnaires, question No. q is relevant for 
m, = m questionnaires. Then the discrepancy rate is the proportion of m, questionnaires 
that failed edit, whether by item nonresponse or faulty coding. The ambiguity in the defini- 
tion lies in whether the subset m, should include those questionnaires with the question 
completed in error, those with the question left blank in error or merely those questionnaires 
with the question coded correctly or incorrectly. Notwithstanding the possible ambiguity in 
the definition, the item discrepancy rates for about 50 items as analysed for calendar year 
1984 should indicate an upper bound to the fractional error in the estimates of statistics bas- 
ed on the items. A sample of item (defined in Table 4a) discrepancy rates for 1984 is given 
in Table 4 below. 

Thus, for a straightforward item like (10) ‘‘Did the respondent do any work last week? 
Yes or No,’’ the discrepancy rate is only 0.2%, much lower than even the national standard 
error. For more complex items likes Nos. 12, 36, 41, 54 and 77 the discrepancy rate averages 
more than 10% with ranges 2 to 6% in either direction from the mean over the year. The 
discrepancies are corrected for, by hot deck procedures, use of last survey’s responses (if 
available) or by logical deduction from other questionnaire data. Thus, in many instances 
an item discrepancy may be altered to a response subject to response rather than imputation 
error so that the discrepancy rates should be construed as an upper bound to the overall 
imputation error rates for the items. 


(c) Unit Nonresponse and Weight Adjustment 


In the case of unit nonresponse the two components of Y given by (3.4) and (3.5) must 
be studied together since unit nonresponse is generally compensated for by a weight adjust- 
ment (wa); rather than direct substitution z; for a missing unit value. Weight adjustments 
are usually calculated by inverse rates in adjustment cells of which there are two basic types, 
balancing areas and weighting classes. Balancing areas are frequently design-dependent 
geographic areas such as a stratum, primary sampling unit, cluster, or a groups of strata 
or even the entire sample. Weighting classes are defined by post-strata (strata defined after 
sampling) formed on the basis of information available to both respondents and 
nonrespondents in the sample. The nonrespondent’s information may be obtained from partial 
nonrespondents with some known characteristics even though the particular characteristic 
being estimated is not known for the partial nonrespondents. Alternatively, the information 
may be derived from external sources pertaining to the nonrespondents. Inverse response 
rates may be calculated for either balancing areas or weighting classes and used as weight 
adjustments to compensate for missing data in the cells. 
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Table 4 
Average Discrepancy Rate by Item (defined in Table 4a) 


Item 


ae eae cy Range of Rates in 1984 
Rate (Min. to Max.) 
0.2% 0.2% Every month 
12.3% 10.4% to 14.3% 
6.7% 5.7% to 8.4% 
0.4% 0.3% to 0.5% 
6.6% 2.0% to 9.9% 
0.4% 0.3% to 0.5% 
7.0% 3.0% to 11.6% 
4.3% 1.8% to 6.0% 
10.6% 8.1% to 12.7% 
4.1% 1.5% to 6.8% 
12.1% 6.2% to 19.7% 
10.1% 7.9% to 12.1% 

<0.1% 0.0% to 0.1% 
15.0% 11.8% to 17.3% 


Source: Internal report by Karen Switzer to P.D. Ghangurde March 4, 1985 ‘‘Some Findings on the 


(10) 
(12) 


(14) 
(16) 
(17) 


(30) 
(32) 


(33) 
(36) 
(40) 
(41) 


(54) 


(76/77) 


Field Edit Module (FEM) Reports from 1984’’. 


Table 4a 
Definition of Items 


Last week did (respondent) do any work at a job or business? Yes or No. 


If yes to 11, ‘‘Did... have more than one job last week, was this a result of changing 
employers?’’ Yes or No. 


What is the reason... usually works less than 30 hours per week, if actual response to (13) 
no. of hrs. worked 30. 


Last week, how many hours was ... away from work for any reason whatsoever (holidays, 
vacations, illness, labour dispute, etc.) ‘‘00’’ should be filled in 


What was the main reason for being away from work? (10 possible codes) 
Last week did ... have a job or business at which he/she did not work? Yes or No. 


Counting from the end of last week, in how many weeks will ... start to work at his/her 
new job? (Reply to Yes in (31), ‘‘Last week did ... have a job to start at a definite date 
in the future?’’) 


Why was ... absent from work last week? (8 possible codes) 
Identical to (14) but pertaining to Unemployed instead of Employed individuals. 
Inthe past 4 weeks has ... looked for another job? Yes or No. 


What has ... done in the past 4 weeks to find another job? (8 possible codes, 1 to 3 different 
codes in 1, 2, or 3 spaces). 


What was the main reason why ... left that job? (9 possible codes) in response to yes to 
(50) has ... ever worked at a job or business (pert. to individuals permanently unable to 
work) and questions (51) to (53) dealing with date of last job and part/full time status. (54) 
is slipped if date of last job not too recent according to a pre-printed date in (52). 


Class of worker and whether or not same as previous month, with respect to main job (76) 
and other job (77) 


Survey Methodology, June 1985 9 


There are several types of weight adjustments available for inflation of the sample to com- 
pensate for unit nonresponse, the most common being the inverse response rate defined by 
the ratio of the sample size to the responding sample size in an adjustment cell. Thus, if 
the cell contains N, units in its population and is represented by n, selected units, where: 


ny = YL ,, ¢ the sample size in cell b which may or may not be a constant; 
depending on the definition of the cell, 


N, = L,,%iti » an estimate of the size of cell b in the population, usually N, 
would not be known except in a census. 


m, = ¥,,, 46 = no. of responding units in cell }, i.e., the responding sam- 
ple size, 
then, (wa); = n,/m, when i lies in adjustment cell b. (3.6) 


Before defining other possible weight adjustments, we will concentrate on the frequently 
applied inverse unweighted response rate in a cell as in (3.6). The estimate of the total defin- 
ed by (2.2) with (wa); = n,/m, may be rewritten as a special case of (2.1), with z; given 
by: 


ae eT, (3.7) 


where 7, = Y,,71' td;[5,¥; + (1 — 4,)zi,], sample weighted total of responding units in 
cell b. In the case of equal sample weights in a cell, the imputed value z; simplifies to the 
mean value of m, respondents in the cell. By substituting z; given by (3.7), into (2.1), it 
may be shown that the estimate is identical to (2.2) with (wa); = (1y/ my). Thus, one may 
regard imputation for unit nonresponse as a substitution of z, = 7,/(x; 'm,) in (2.1) or as 
a weight adjustment to the sample weights by (wa); = n,/m, in (2.2). In the case of the 
weight adjustment, one would set z, = 0 in (3.4) in Y as split up into 5 components. Alter- 
natively, one may employ the imputed value z; as defined in (2.1) and in that case, one 
would set (wa); = 1 in (3.5) resulting in that component of Y = 0. Thus in order to con- 
sider the effect of weight adjustment (wa); > 1, both the negative component (3.4) and 
positive component (3.5) must be studied together; but to consider the effect of the implicit 
imputed value z; , given by (3.7), one needs only to consider (3.4). 

The weight adjustment (”,/m,) is used in LFS, where the adjustment cells are design- 
dependent psu’s in non-self representing areas (NSR) and strata (subunits) of contiguous 
city blocks in self-representing areas (SR). In Table 5, the number of cells, the unweighted 
average of the weight adjustments and the frequency distribution of the weight adjustment 
in intervals 1-1.01, 1.01-1.02. ..., 1.10 and over are given by region/type of area for the survey, 
Jan. 1983. 

The average weight adjustment of 1.0348 at Canada level is less than what one would 
expect with a nonresponse rate of about 5%. The reason for the apparent low average weight 
adjustment is that, for purposes of calculations of the inverse response rate, some unit 
nonrespondents with available responding data of the previous month for imputation pur- 
poses are treated like respondents. This applies to about 20 to 30% of the nonrespondents 
every month. 
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Table 5 
Number of Adjustment Cells, Average and Frequency Distribution of the 
Weight Adjustments by Region/Type of Area. January, 1983 


No. of cells in intervals of (wa); 


Region No. Aver. I1- 1.01- 1.02- 1.03- 1.04- 1.05- 1.06- 1.07- 1.08- 1.09- 
Type ofArea Cells ~ (wa), 1.01 — 1.02)731.03 1.04 1.05 1:06 1.0799.1.08 _109GeA 10" 110 


Atl. NSR 254 1.0250 143 6 7p) 21 13 13 9 q 8 2 10 
Atl. SR 123 1.0246 58 5 11 15 14 4 3 6 4 1 2 
Que. NSR 126 1.0550 72 2 8 10 10 6 8 6 0 1 3 
Que; SR 185 1.0265 106 0 7 8 ZS 11 4 5 7 3 11 
Ont. NSR 120 1.0333 58 1 10 11 11 8 4 2 2 2, 11 
Ont. SR 252 1.0416 116 1 13 24 21 16 9 9 8 10 25 
Pie NSR 328 1.0348 167 5 17 22 23 24 15 12 10 8 ZS 
Jeie SR 149 1.0306 40 23 23 20 13 8 7 3 5 4 3 
BC NSR 85 1.0468 38 3 7 8 8 2 5 1 1 1 11 
BC. SR 119 1.0412 46 4 7 15 10 7 7 7 3 3 10 
Can. NSR 913 1.0358 478 17 64 72 65 53 41 28 21 14 60 
Can. SR 828 1.0337 366 33 61 82 81 46 30 30 il 21 51 
Canada 1,741 1.0348 844 50-125 154 146 99 a 58 48 35 111 


Without a knowledge of the nonrespondents’ characteristics, it cannot be determined 
precisely the threshold level beyond which the weight adjustment would become critical to 
result in an unacceptable bias along with an increase in the variance due to a smaller effec- 
tive sample size. If the threshold is arbitrarily set for LFS at 1.05 (a level sometimes assumed 
by survey practitioners) then about 1/4 of the balancing units (441 out of 1,741) across Canada 
had critical weight adjustments of 1.05 or more in Jan. 1983. In many other surveys such 
as those dealing with income and expenditure, the nonresponse rate is higher overall and 
would likely be critical in nearly all cells if the same threshold of 1.05 is assumed. 

There are other types of weight adjustments in cells. For example, one could exclude from 
cell b as defined above, those units that contain item nonresponse for at least one question. 
Let us suppose there are m,, units in cell b free of item nonresponse for the whole set of 
questions on the questionnaire. For (m, — mm, ) responding units in the cell with some item 
nonresponse the weight (wa); = 1, and for the remaining 1,9 responding units, free of item 
nonresponse, the weight adjustment is given by: 


(wa); = [m, — (mM, — Msg)]/Mpg, which exceeds n,/m,. (3.7a) 


The following is the justification for applying no weight adjustment i.e., (wa); = 1, for 
those units in the cell with some item nonresponse but a larger weight adjustment (3.7a) than 
(n,/m,), for those units free of item nonresponse Records with item nonresponse likely con- 
tain response and imputation errors while records free of item nonresponse contain only 
response errors and with the large weight applied to records free of item nonresponse, it 
may be possible to obtain estimates with lower mean square error than by using the same 
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weight adjustment for all m, responding units in the cell. To our knowledge, weight ad- 
justments such as described above have not been applied but they may be worthy of study 
if the decrease in the bias offsets the increase in the variance that would occur with the dif- 
ferent weights. 

In the case of units with unequal probability sampling, there exists a weight adjustment 
based on the weighted sample and responding units in a cell instead of the unweighted ones. 
In such as case, 


(wa); = N,/My, (3.8) 


where M, = ¥ ent 't6; is the sample weighted count of responding units in cell b. For the 
analogous case to the weight adjustment (wa); in (3.7a) applied only to responding units 
free of item nonresponse, 


(wa); = [N, = (M, - Myo) 1/Mso (3.9) 


iq, the weighted count of responding units in cell b, free of 


where Mig = Y,,, 7) t6;T19_,6 
item nonresponse. 

dig = 1 or O according as unit 7 responded or did not respond to question no. q of the 
survey questionnaire containing Q questions; thus, IT2_, 6,, = 1 only if responding unit i 
is free of item nonresponse. 

The justification for using (3.9) in lieu of (3.8) may be similar to that for using (3.7a) 
instead of (3.6). The justification for using weighted in place of unweighted response rates 
needs explanation and is provided after Table (6). 

One could derive separate (wa); expressions as of (3.7a) or (3.9) for each question q or 
for each characteristic y, defined by a set of one or more questions. Unfortunately, one would 
be faced with different weight adjustments in an adjustment cell for different questions or 
characteristics resulting in inconsistencies among different characteristics in published tables. 
In order to ensure uniform survey weights and weight adjustments, (wa); should depend 
only on the unit and not on the question or characteristic though one may permit imputa- 
tions for some items while excluding them for other items such as major ones in the weight 
adjustments (3.7a) or (3.9) as long as the inclusions and exclusions are consistent in the ad- 
justment cell. For example, one may consider an imputation for missing item by logical deduc- 
tion rather than by hot decking as pertaining to a record free of item nonresponse for weight 
adjustment purposes. 

For each of the above weight adjustments as in (3.6) to (3.9), it can be shown that (2.2) 
is a particular case of (2.1) with z; given by a weighted or unweighted mean of respondents. 
Thus, the implicit imputed value z; for nonresponding unit / for each of the four cases of 
weight adjustments cited above is given by the expressions in Table (6). Additional notation 
is required for the expressions as given below: 


N, 
fT, = py tr; '6[5,¥; + (1 — 6,)z,] = sample weighted total of unit respondents (3.10) 
. including imputations for item nonresponse 
but excluding weight adjustments by inverse 
unit response rate. 
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Ss 
Ne 
| 


Nb 

= )) t07 6b); = sample weighted total of unit and item respondents (3.11) 
=I : aeae 

with respect to characteristic y, 


Troy = L, tim; "6; il 5,,¥; = sample weighted total of unit and item respondents (3.12) 
pie with respect to characteristic y, but excluding those 

records in the cell with imputation for any item 

nonresponse 


Whus, i, = ly See 


The weight adjustment (n, —m, + My9)/Myg = 1 + (My — MyQ)/MyQ Of (c) = the 
weight adjustment of (n, /m,) of (a) since m, . < m,(see Table 6). Hence, for a given 
response rate m,/n, ina cell, one may anticipate a larger variance of an estimate using (c) 
than one using (a). The larger variance may or may not counteract a potentially smaller im- 
putation bias in the overall mean square error. The same holds true in the case of applying 
weighted response rates (N, — M, + Mjo)/Myo in (d) as opposed to N/M, in (b) since 
Myo < M,. When pps sampling is applied, the use of weighted vs. unweighted response 
rates leads to another interesting result. It is shown in Platek and Gray (1983), p. 264-265 
that, when the response and selection probabilities, i.e., a, and 7;, are positvely correlated, 
the weight adjustments with weighted response rates will tend to be higher than those with 
unweighted rates. Thus under the condition of positive c orrelation between a; and 7;, 
E(N, /M,) > E(n, /m,) and similarly, E[((N, — M, + Myo)/Mio] > El(ny — my. + myo) 
/My 9], where E = E,E,, the expected value overall possible samples of units and sub- 
samples of responding units as described by Platek and Gray (1983), p. 251. 


Table 6 
Implicit Imputed Value for Unit Nonrespondent by 
Weight Adjustment (Cell Level) 


Weight Reference Implicit Imputed peceuun 
Adjustment in text value when /=0 eScHIDitae 
(a) n,/My, (3.6) T,/(a;'m,)  Unweighted unit respon- 
se rate 
(b) N,/M, (3.8) T,,/M, Weighted unit response 
rates 
(c) Ny — M, + Myo (3.7a) Tyoy/™; 'Mpyg Unweighted unit respon- 


se rates among units free 


Mp9 : 
of item nonresponse 
(d) N, — Mp + Myo (3.9) Toy /Moo Weighted unit response 
ae ij ae rates among units free of 
bQ 


items nonresponse 


Note: In the case of self-weighting sample (srswor as a particular case), the implicit imputed value 
z; becomes the simple mean of respondents for both cases (a) and (b), and the simple mean 
of respondents (excluding those with some item nonresponse) in the cases of (c) and (d). 

* See appendix | for derivation. 
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Whatever the weight adjustment used to compensate for unit nonresponse, it is doubtful 
that the individual values z; implicit imputed would be close to the individual true values 
Y; or even to the potential observed responses y;. The best that can be achieved with the 
weight adjustment is to hope that adjustment cells formed to compensate for missing data 
due to unit nonresponse will ensure minimum differences between the characteristics of 
respondents and nonrespondents in the cells. Thus, the formation and delineation of adjust- 
ment cell is most crucial for compensation regardless of the type of weight adjustment that 
is applied. 


7. FINAL REMARKS 


As seen in the sections above, there is no ready-made solution to the missing data, whatever 
the types that occur. The initial strategy is to minimize the occurance of missing data to the 
extent possible, without incurring great cost or sacrificing the timeliness of the survey data. 
Every attempt should be made at the onset to prepare for some nonresponse and set up im- 
putation strategies. If missing data occur in about the manner anticipated, then the survey 
data processing ought to proceed on schedule, with the appropriate substitutions or weight 
adjustments. Clearly, the scheduling of survey data collection, publishing, etc. can proceed 
in a more orderly fashion in continuous or repeated surveys than in ad hoc one-time surveys 
for which the survey designer may not realize, until after the fact, all the things that can 
go wrong such as unexpected refusals or lack of interest on the part of both interviewers 
and respondents. 

In order to deal with the nonresponse problems it is essential to maintain a continuous 
study of nonresponse rates by the survey characteristic (in the case of item nonresponse), 
reason for nonresponse, and if possible, to extend the study to an analysis of item and unit 
response probabilities so that imputation biases may be estimated from the survey itself. Alter- 
natively, model-based estimates may continue to be explored to examine the imputation bias 
and, furthermore, to strengthen the estimates by employing additional information. 


APPENDIX 
Derivation of Implicit Value z; for Unit Nonresponse imputation 


In the case of (c) and (d) of Table 6, the estimate of cell b level is given by: 

Vee ua) (pedis) (A.1) 
= Pec all (7) iO aes 

In case (c), (wa); — 1 = (n, — m,)/Mog 


or Y, = 1, + Y te '(l — 8)Tsgy/ 7 Mog (A.2) 


14 Platek and Gray: Nonresponse Adjustments 


or by equating (A.2) to (A.1), noting the definitions of 7, in (3.10) and Y in (2.1), one may 
see that the imputed value z; is given by T wh T; ‘Myo as stated in (c) of Table (6). 

Similarly, when weighted response rates are employed, the implicit imputed value z; may 
be found to be Tigeh Myo as in (d) of Table (6). The results for (a) and (b) of Table (6) 
follow by setting m,, = m, and M,g = M,. 
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Conditional Inference in Survey Sampling 
J.N.K. RAO! 


ABSTRACT 


Conventional methods of inference in survey sampling are critically examined. The need for condi- 
tioning the inference on recognizable subsets of the population is emphasized. A number of real ex- 
amples involving random sample sizes are presented to illustrate inferences conditional on the realized 
sample configuration and associated difficulties. The examples include the following: estimation of 
(a) population mean under simple random sampling; (b) population mean in the presence of outliers; 
(c) domain total and domain mean; (d) population mean with two-way stratification; (e) population 
mean in the presence of non-responses; (f) population mean under general designs. The conditional 
bias and the conditional variance of estimators of a population mean (or a domain mean or total), 
and the associated confidence intervals, are examined. 


KEY WORDS: Conditional inference; Conditional bias; Conditional variance; Population mean; 
Random sample sizes 


1. INTRODUCTION 


In the conventional set-up for inference in survey sampling the sample design defines the 
sample space S (set of possible samples s) and the associated probabilities of selection, p(s). 
The choice of an estimator is based on the criterion of consistency or unbiasedness and on 
the comparison of mean square errors (MSE), under repeated sampling with probabilities 
D(s), using the sample space S as the reference set. Thus, an estimator Y of a population 
mean Y is unbiased if E(Y) = =) p(s) Y, = = Y, where Y, is the value of Y for the sample 
s. The MSE of the estimator Y is given by MSE(Y) = ae) es p(s)(Y, — Y)’, and Y is consis- 
tent if its MSE approaches zero as the sample size increases. A consistent or unbiased estimator 
of MSE(Y), denoted as mse(Y), provides a measure of uncertainty in Y. If Y is unbiased 
or consistent, then the observed values Y, and mse(Y,) provide a large sample, (1 — a)-level, 
confidence interval given by 


Vmse(Y,), (1) 


where Z.,, 1s the upper /3-point of a N(O, 1) variable. The interpretation of (1) is that in 
repeated sampling with S as the reference set, approximately 100 (1 — a)% of the intervals, 
I,, will contain the true value Y. 

The comparison of unconditional mean square errors, MSE( Y), i is appropriate at the design 
stage, but the sample space S may not be the relevant reference set for inference after the 
sample s has been drawn, if the sample contains ‘‘recognizable subsets’’. The concept of 
recognizable subsets will be illustrated in subsequent sections through examples involving 
random sample sizes. The choice of relevant reference set, however, is not unique. In fact, 
the surveyed sample s can be viewed as unique in a real sense, but then no inference under 
a repeated sampling set-up can be made since the relevant reference set would contain a 
singleton (Holt and Smith 1979). 


I= Y,+2 


oP 


! J.N.K. Rao, Department of Mathematics and Statistics, Carleton University, Ottawa, Ontario, Canada K1S SB6. 
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Conditional inference has attracted considerable attention and controversy in classical 
statistics since Fisher (1925). For instance, in testing for independence in a 2 x 2 table of 
counts, Fisher argued that the inference should be conditional on the observed row and col- 
umn marginal totals even if the margins are not fixed by the design. Yates (1984) revived 
this problem. The choice of relevant reference set is not always clear-cut, but the following 
guidelines look reasonable: (1) A conditional procedure should be chosen before observing 
the data, especially in the public domain. (2) A conditioning partition of S should be chosen 
in such a way that the partition contains no (or little) information on the parameters of in- 
terest, i.e. the statistic indexing the partition should be an ancillary statistic (Cox and Hinkley 
1974, p. 38). (3) If the sample sizes are random (e.g., domain sample sizes) and their popula- 
tion distribution is completely known (or at least partially known), then the inferences should 
be conditional on the observed sample sizes. In this context, Durbin (1969, p. 643) says “‘If 
the sample size is determined by a random mechanism and one happens to get a large sample 
one knows perfectly well that the quantities of interest are measured more accurately than 
they would have been if the sample size had happened to be small. It seems self-evident that 
one should use the information available on sample size in the interpretation of the result. 
To average over variations in sample size which might have occurred but did not occur, when 
in fact the sample size is exactly known, seems quite wrong from the standpoint of the analysis 
of the data actually observed’’. 

The discussion throughout the paper will be confined to conditional inference in the 
presence of random sample sizes, as in guideline (3) above. Even with this restriction, it will 
be shown that conditional inferences are not always easy to implement in practice. We begin 
our discussion with simple examples and then extend it to more complex problems. In the 
context of sample surveys, Holt and Smith (1979) provide the most compelling arguments 
in favour of conditional inference, although their discussion was restricted to poststratifica- 
tion of a simple random sample (SRS); see Section 3.1. 

Lahiri (1969) pointed out the “difficulties of conveying convincingly the real import of 
the sample survey estimates to intelligent but lay users of statistical data’; in particular, ‘the 
fallacy in implicitly using the (sampling) standard error as a measure of precision of the observed 
(sample) estimate, illustrating this point with a number of examples drawn from the current 
theory”. 


2. SIMPLE RANDOM SAMPLING WITH REPLACEMENT 


Simple random sampling (SRS) with replacement is seldom used in practice, but it pro- 
vides a simple introduction to conditional inference. 

Suppose a simple random sample, s, of size n is selected fom a population of size N with 
replacement so that S contains N” samples s. Let vy denote the number of distinct units in 
s. Then pv is a random variable with possible values 1, ..., m. Let t; denote the number of 
times the /-th population unit is included in s. Then two well-known estimators of the popula- 
tion mean Y are given by 


. 1 
J. = — Yt; (2.1) 
Nn iés 


the sample mean based on all the n draws, and 


Vitae (2.2) 
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& 


the mean based on the distinct units in s. Both ¥,, and y, are unconditionally unbiased under 
the reference set S, and the unconditional variance of y, is always smaller than that of y,. 
Hence, from efficiency considerations y, should be preferred over y,. The Horvitz- 
Thompson estimator 


Shy Ss Sy, (2.3) 
es 


is also unconditionally unbiased, where 7; is the probability that unit / is included at least 
once in the sample: 


pe gw eee ties 
N N 


The comparison of variances of y, and ¥,, shows that y, is not always better than y,. 

Following Durbin’s (1969) argument, it is clear that for the purpose of inference one should 
condition on the observed value of »y, i.e., the relevant reference set is the set S, of () 
samples of effective size v, and not S. Fortunately, it is easy to implement conditional in- 
ference in this case since P(s,|v) = ()~', i.e. conditionally, the observed sample, s,, of 
distinct units is a simple random sample of size »y drawn without replacement. It follows 
that y, is conditionally unbiased, i.e. E,(y,) = Y where E, denotes conditional expectation, 
whereas F,(¥;7) = [v/E(v)]|Y # Y so that y,, is conditionally biased. Hence, y, should be 
preferred over ¥,7, despite the inconclusive comparison of unconditional variances. Note 
that y,,; would be a serious underestimate if the observed v is much smaller that E(v). 

A relevant measure of uncertainty is the conditional variance, V,(¥,), which is estimated 
unbiasedly by 


W,) = + as al (2.4) 


where (v — 1)s;, = Yies(v; — J,’ and V, denotes the conditional variance. The appropriate 
confidence interval for Y is given by 


I, = y, + z,, Vv). (2.5) 


Conditionally, the confidence level of J, is 1 — a approximately if v is not small. Another 
variance estimator 


Crete le(+)- 1s, (2.6) 


V 


is conditionally biased, although unbiased when averaged over the whole sample space, S. 
It follows from (2.4) and (2.6) that v(v,) < v*(j,) if 1/v < E(1/v) and vice versa if 1/v > 
E(1/v). Thus, the confidence interval based on (2.6) would be too narrow if E(1/v) < 1/y 
and hence yield a confidence level less than 1 — a, and too wide if E(1/v) > 1/» leading 
to a confidence level greater than 1 — a. It may be noted that confidence intervals that are 
conditionally correct are automatically correct in the unconditional framework. 
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3. SIMPLE RANDOM SAMPLING WITHOUT REPLACEMENT 


Suppose a simple random sample of fixed size n is drawn without replacement. In the 
absence of recognizable subsets, the relevant reference set is the set S of (7) samples s, each 
of size n, and the sample mean jy, is unbiased and its variance is estimated unbiasedly by 


vi) = (4 - +}s, G.1) 


where (n — 1)s, = Dies; — ¥,). The resulting confidence interval is given by J,: 
Vn igcie, Mn) with confidence level 1 — a@ approximately if n is not small. 

Suppose now that recognizable subsets exist in the sense that we observe the sample con- 
figuration n = (n,, ..., n,) belonging to k post-strata with known weights W, = N,/N. 
Ideally, stratified sampling should have been used but the strata frames were not available. 
The relevant reference set now is the set S, of [] (7!) samples having the realized configuration 
n since the distribution of m is completely known. 


3.1 All n; = 1 


If all the observed n; = 1, then the customary post-stratified estimator 
Vor = LW; (3.2) 


is conditionally unbiased given n since P(s|n) = [[(7/)~', i-e., conditionally the observed 
sample s is a stratified random sample (s,, ..., S,) with strata sample sizes n;. Here J; 
denotes the sample mean in the /-th stratum. A relevant measure of uncertainty is the condi- 
tional variance, V,(),,,), which is estimated unbiasedly by 


Und) = EWI — 5 Ns (3.3) 


provided all n,; = 2, where (n; — 1)si, = Y jes, (¥%j — ¥)° (Holt and Smith 1979). The 
resulting confidence interval, Zp? Voss + Z,,/VO ps), 18 conditionally correct. Another vari- 
ance estimator 


Vpn) = EW? Fee 3.4) 
= (b-joma 


is conditionally biased, although unbiased when averaged over the whole sample space, S 
(assuming that P(n; < 1) is negligible). The conditional performance of confidence inter- 
val based on (3.4) evidently depends on the extent of divergence of the observed values 1/n; 
from their expectataions E(1/n;). It may be noted that the interval J,,, is also correct in the 
unconditional framework, provided P(n; < 1) is negligible for all i. 

If n; = 1 for some i, no conditionally unbiased variance estimator can be obtained, but 
it might be satisfactory to use a collapsed strata method or use the model-based solution 
of Hartley et a/. (1969) originally proposed for variance estimation in stratified random sampl- 
ing with one unit per stratum. Empirical studies might throw some light on the applicability 
of the latter methods. 


Survey Methodology, June 1985 19 


The customary justification for preferring y,,, over y is that the unconditional variance 
of ¥,. is approximately equal to the variance under proportional allocation and hence 
smaller than the unconditional variance of y. We are also reminded that gains in efficiency 
under proportional allocation are likely to be modest. It is more important, however, to note 
that the sample mean ) is conditionally biased: 


E40) = y wY; # LAs => ¥; ee 


I 


; (335) 


31S 


and hence the resulting inferences could be conditionally incorrect. 


Example 1. Suppose k = 2 (say, male, female strata with known projected census weights 
W, and W, = 1 — W,, or small and big hospitals (Royall 1970)). Royall used a super- 
population model 


Ey) = BX» i= 1, ...,N, B > 0, x, > 0 (3.6) 


to demonstrate that y is model-biased conditionally, where E,, denotes the model expecta- 
tion, i.e., 


E,,0) = BX # E,(Y) = BX (3.7) 


unless the sample mean X coincides with the population mean X. In his example, 
x; = number of beds in the i-th hospital, y, = number of occupied beds in the i-th hospital, 
and x,, ..., X, are known. Royall argues that y leads to serious underestimation if the 
observed sample contains all (or mostly) small hospitals since B,,(y) = E,,(y) — E,(Y) = 
B(x — X) and x << X. This point can also be illustrated in our conditional framework 
without assuming a model. The ratio of the conditional bias of y to the population of large 
hospitals, Y,, may be expressed as 


=) = (W, — w)8 = (wr, - W25. OS 


2 


where B,(y) = E,(y) — Y denotes the conditional bias of y, 6 = (Y, — Y,)/Y, and 
0 < 6 < 1 since the population mean, Y,, of small hospitals is smaller than Y,. If w, = 1 
(i.e., all small hospitals observed in the sample), then E,(y) = Y, << Y and hence y is a 
serious underestimate. Similarly, if w, >> W, (i.e., mostly small hospitals observed), then 
it follows from (3.8) that » would lead to serious underestimation. 

In this example, one should use the post-stratified estimator ¥,,, = Wy, + Wy, which 
is conditionally unbiased unless m, = 0 or n, = O. It might be preferable, in fact, to use 
a post-stratified ratio estimator 


Ie Xe (3.9) 
x 


pst 


where X,, = W,X, + W,X, and X; is the sample mean of x in the i-th stratum. The estimator 


(3.9) is approximately unbiased conditionally and more efficient than y,,, if n is large. 


Remark 1. In Royall’s example, one should, in fact, use a more efficient design than simple 
random sampling since all the population x-values are known, e.g., stratified random sampling 
under x-stratification and, perhaps, optimal allocation based on the x-values. 
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Remark 2. Royall justifies the use of the customary ratio estimator y, = (/xX)X under his 
model (3.6), but it cannot be justified in the conditional (repeated sampling) framework since 
y, is conditionally biased: 


Pauses + WyY, # 


ns (3.10) 
ge + wX, xX 


BLY) = 


#0 


unless y,/X, = J./X, = R. In the extreme case of w, = 1, B,(y,.) = X(R, — R) where 
R, = Y,/X,. Hence, B,(,) 5 0 according as R, S R. 


Remark 3. If the weight W, is unknown but X is known, we cannot implement either J,,,, 
OF Yps,-- Royall suggests the use of y, with inference conditional on the observed mean x. 
However, the choice X is somewhat arbitrary, and the conditional bias of y, could be quite 
large unless the model (3.6) is true, at least approximately. 

If good prior information on W, is available, say W*¥ < W, < W** where W* and Wx* 
are known, then one could use the following ‘‘pseudo’’ post-stratified estimator of Y: 


Dost ai Vy, at Wy, Gul 


where W, = w, if W* < w, Ss W**, = W* if w, < W,*, = W** if w, > W** and 
W, = 1 — W.. The estimator yx, and its ratio analogue should perform better conditionally 
given (7,, m,) than y and y,, although biased. Unconditionally, the MSE of yx, should be 
smaller than the MSE of y, provided W* < W, < W#*. One could also utilize a formal 
Bayesian approach to estimate W, by specifying a prior distribution on W,. 


Example 2 (outliers). The problem of estimating a population mean Y in the presence of 

outliers is similar to the hospital example above. Suppose the population is known to con- 

tain a small fraction, W,, of outliers (large observations) but W, is unknown, i.e. 

W, >> W, and Y, >> Y;. Then, if the observed sample contains no outliers (i.e., w. = 0), 

we would say that y is ‘“‘far from the true value Y’’ (Chinnappa 1976) and yet y is (un- 

Fa oa unbiased. The meaning of this statement follows from the fact that 
E,(v) = Y, << Y, where E, is the conditional expectation as before. 

On sas sine hand, we would say that y is a serious overestimate if the sample contains 
outliers. This follows from (3.8) noting that w, >> W, (since W, is very small). For in- 
stance, if N, = 1 then w, = 1/n >> W, = I/N. In this situation, we are told to modify 
the estimate y by reducing the weight attached to outliers in the sample. One suggestion is 
to modify y by reducing the weight attached to outliers from 1/n to 1/N and adjusting the 
weights for non-outliers such that the n weights sum to 1: 


ye = J partes V2. (3.12) 


The conditional relative bias of y* is given by 


sb 2 [v. 2 z w:)6, (3.13) 
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whereas B,(y)/Y, = (w, — W,)6. If Wa — W, < 0, then 


ees W, 


= W,- 2 < w, ~ W,if 2W, < w,(l + 2). 
2: 2 Ni 2 2 2 2 N 


The inequality 2W, < w,(1 + n/N) should be satisfied since w, >> W,. If wn/N — W, > 0, 
then 


Hence, the estimator y* should have a smaller absolute value of conditional bias than y. 

The estimator y* is essentially obtained from the post-stratified estimator y,,, by preten- 
ding that N, = n,. A more satisfactory solution can be obtained by gathering good prior 
information on W,(= 1 — W,), say from census data, and then using the estimator y*, or 
the estimator based on a Bayes estimator of W,. 

Hidiroglou and Srinath (1981) derived the conditional bias and conditional and uncondi- 
tional MSE of y, y* and some other modifications of ¥, but they did not compare the condi- 
tional biases of y and y* as above. 


3.2 n; = 0 for Some i 


If the total sample size, n, is small or if too many post-strata chosen, then n; could be 
zero for some i. The post-stratified estimator (3.2) in this case reduces to 


Yast = L’ Wy, (3.14) 


where )’ denotes summation over strata with nonzero n;. The estimator (3.14) is condi- 
tionally biased: 


E, Ops) = LW, # Y W,Y,. (3.15) 


It remains conditionally biased even under the strong assumption Y, = Y for all i, which 
incidentally shows that y,,, could lead to serious underestimation. It is also unconditionally 
biased. One commonly used method to overcome these difficulties is to collapse similar strata 
to ensure that n; > 0 for all i in the reduced set of strata. Fuller (1966) proposed a more 
efficient solution for the special case of k = 2 post-strata, but his framework is uncondi- 
tional in the sense that the probability, P*, of n, = 0 given that either n, = 0 orn, = 0, 


is brought into the picture. His estimator is given by 


E Wisi 
p= ——), 1 0 
fies. 3B ioeniint (3.16) 


Vows i 
= Px? if n, = 0, 
where P¥ = 1 — P*. The estimator y, is conditionally unbiased given that either n, = 0 


or n, = 0, but is conditionally biased given (n,, n), even in the case Y, = Ven Ys 
An unconditionally unbiased estimator is given by 


pM SS og Fi. yy ys 3.17 
Yo » Fa Vy (3217) 
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(Doss et al., 1979), where a; = 1 if at least one unit from stratum / in the sample, = 0 
otherwise, and J, is defined as Y, if n; = 0 (note that a,y, = 0 if n; = 0 even though Y, is 
unknown). The estimator ¥,, however, is conditionally biased since 
EU) = T/ 2! + LMY, = ¥. 
E(a) 
It remains conditionally biased even if Y,; = Y for all i. 

Doss et al. criticized Yp on the grounds that it is not translation-invariant (i.e., Yp does 
not change to Yj) + c when each y, is changed to y; + c, where c is an arbitrary constant), 
and hence that the variance of yp, when y, is changed to y; + c, can be made arbitrarily 
large by increasing c sufficiently. On the other hand, the ratio estimator 


qj As 
: LE) 


(10) OF ’ (3.18) 

» E(a;) UA 
proposed by Doss ef al., is translation-invariant. It is conditionally biased, but the condi- 
tional bias is approximately zero if Y,; = Y for all i, unlike the conditional bias of Jp. 
Another ratio estimator which is similar to y,, conditionally is given by 


GA, 


(ee 31-5 fl 
LW: Sires 


y r(pst) = 


but it is inconsistent unconditionally, unlike y,p . Hence, y,p may be preferred to Y,ps OF Vp - 

If concomitant information on ai// strata is available, then one could fit a model to the 
observed strata means y; and predict the population means of strata with n; = 0. For ex- 
ample, if the population means X, of a concomitant variable are linearly related to the cor- 
responding Y,, then the predicted value of a Y, is given by & + BX, = y* (say), where & 
and @ are the least squares estimators obtained by minimising Y’(¥, — a — BX)’. The 
resulting estimator of Y is given by 


PIO A ae Ne (3.20) 


where )}” denotes summation over strata with n; = 0. This estimator should have good 
conditional properties if the fitted model is adequate. It should be clear from this discussion 
that there is no simple solution if n; = 0 for some of the strata. 


4. TWO-WAY STRATIFICATION 


Ingenious designs to improve the efficiency of estimators have been proposed in the 
literature. Bryant et a/. (1960) proposed a design involving two-way stratification in which 
the sample sizes n, are zero for some strata (cells). Their method is supposed to permit 
estimation of the population mean even when the total sample size 7 is less than the total 
number of strata. Using proportional allocation for the marginal sample sizes (n,, n ;), they 
obtained a random allocation n, such that E(n,) = (n,n ;)/n = nW,W,, where W, and W, 
are the row and column marginal totals of cell weights W,,. 

Bryant et al. proposed the estimator 


Py =F LIM Gyp (4.1) 
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where G, = n’W,,/(n,n;) and y,, may be taken as Y pif n, = 0. The estimator yy is uncon- 
ditionally unbiased. However, the distribution of n, is completely known (since all W;; are 
known) and hence the relevant reference set is the set of samples having the observed con- 
figuration {7}, 1.e., one should treat the design as stratified simple random sampling for 


inference purposes. The estimator y,, is conditionally biased: 


njGi. - : 7 
BO Dee) ek oly = Ys 


noting that £,(y,) = Y; ifn, > 0. It also has the defects of yp in the previous section which 
can be circumvented by using the ratio estimator 


Yu Vy » Ly nj Givi 


ra, F5n,G, (4.2) 


where dy = ) Y'n,G;/n. y, is also conditionally biased, but the conditional bias is approx- 
imately zero if Y, = Y for all (i, j). The latter condition, however, may be unrealistic in 
the present context since the strata are different by design. 

As in Section 3.1, it seems necessary to use a model connecting the sampled and non- 
sampled strata. A reasonable model, in the absence of concomitant information, is to assume 
that 


Vig = pe t By ety hk Sax (4.3) 


where y,, is the k-th observation in the (i, /)-th cell, 6; and 7; are fixed effects and e,, are 
independent errors with zero mean and common variance o*. Unfortunately, the linear com- 
bination » + 6; + 7; for nonsampled strata is not estimable from sample data and hence 
the corresponding i cannot be predicted. This difficulty can be avoided by assuming that 
6, and 7; are random variables and then obtaining a predictor ji + B, + 7, but the random 
effects model may be less realistic than (4.3) in the present context. 

Motivated by the above-mentioned difficulty, Bankier (1985) discussed a raking procedure 
in the context of independent stratified samples according to two different criteria of stratifica- 
tion. His estimator is approximately model-unbiased under the fixed effects model (4.3), while 
the usual Horvitz-Thompson estimator and its ratio extension are model-biased. 

Bankier’s method can be adapted to the two-way stratification problem. The raking ratio 
estimator of Y is given by 


G;(p) 


n 


yD) sae Vij (4.4) 


where y;, is the sample total in the (i, /)-th cell (vy; = 0 of n; = 0) and G,(p) are the values 
obtained in the p-th iteration of the raking procedure such that 


G, 
pecs ny enins FW 
= 


y 


(4.5) 
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and 


The G,(p) are obtained as follows: Let G,(0) = G,; > 0 vi, /), and 


G;(p) = G;(p a 1) enh if p is odd 
3 amr oale Nj (4.6) 
J 
= Gp — 1 : if p is even. 
iP ) CAIs) ; if p 
» n y 


Under the fixed effects model (4.3), we have 


EY) | eee ae as 5 LW; = BAY Y WY) 


sis OO) 


i.e. Y(p) is approximately model-unbiased. Since E(G,({0)n;/n) = W,, for the choice 
G,(0) = G,, these starting values should be good. However, we may encounter convergence 
problems with the raking process because of the many empty cells (m,, = 0) resulting from 
the Bryant et a/. design. We hope to investigate these convergence problems as well as the 
conditional properties of the raking ratio estimator (4.4) in a separate paper. 

If the population means Xi of a concomitant variable x are known for ai/ strata, then 
one could fit a model to the observed strata means y,, , as in Section 3.1. For example, the 
model y,, = BX; + b; + t, + €; with random effects b; and ¢; might be reasonable, where 
€, is the sample mean of errors ¢,, in the (i, /)-th cell. A predictor 6x, + b, + ¢, of Y;; for 
nonsampled strata may be used in conjunction with the observed means y, to arrive at an 
estimator of Y. This approach is similar to modelling for small area estimates, except that 
the parameter of interest here is the overall mean Y rather than the individual cell means 
Yen We hope to investigate the conditional properties of alternative estimators of Y in a 
separate paper. 


5. NONRESPONSE 


5.1 A Simple Model 


Suppose m responses are obtained in a simple random sample of size n. Let W, denote 
the proportion in the response stratum and Y = W,Y, + W,Y, the population mean, where 
Y, and Y, are the means of response and nonresponse strata respectively, and 
W, = 1 — W, . In this situation, conditioning on the observed value of m can be question- 
ed since the distribution of m depends on the unknown W, which is involved in the 
parameter of interest. Also, the sample mean y,, of respondents is unconditionally biased 
because E(y,,) = Y, # Y. Hence, it is necessary to assume a model for response mechanism 
even in the unconditional framework, unless a subsample of nonrespondents is also sampled. 


Survey Methodology, June 1985 25 


A simple model assumes that the probability of response if contacted is the same for all units, 
say p*, i.e., data are missing at random. Under this model, the distribution of m depends 
only on p*, and hence we should condition on m if p* is assumed known (or at least partially 
known or unrelated to Y). Oh and Scheuren (1983) have shown that conditionally given m 
the sample s,, of respondents is like a simple random sample of size m from the whole 
population. Hence, y,, is conditionally unbiased, and its conditional variance is unbiasedly 
estimated by 


V2Vn) = (m7) — N7")shy, (5.1) 


where (m — 1)5iny = Y ics, (¥i — Ym). The resulting confidence interval ¥,, + Z.2VV2Vm) is 
conditionally correct, at least approximately, if m is not small. 
On the other hand, the Horvitz-Thompson estimator (p* known): 


m Ji 


Yur = E(m) Vno= (5.2) 


ies, NUp™ 


is conditionally biased, as in Section 2, although unbiased when averaged over the distribu- 
tion of m. For general designs, the ratio estimator 


Jj 
4 sm TPF 
Vorne = (5.3) 
X Tipi 


is often used on grounds of efficiency, where 7; is the probability of inclusion and p> is the 
probability of response if contacted (assumed known) for the /-th unit. In the simple case 
of p*= p* and simple random sampling, it is interesting to note that Y;,,, reduces to J,,. 
Hence, the ratio estimator might perform well in a conditional framework, for general 
designs. 


5.2 A More Realistic Model 


A more realistic model assumes that data are missing at random within post-strata with 
known weights W;. Let n; and m; respectively denote the sample size and the respondent 
sample size in the i-th post-stratum. Then the joint distribution of (”;, m,) depends only on 
the W, and the response probabilities within post-strata. Hence, we should condition on the 
observed value of (n;, m,) provided the post-stratum response probabilities are either known 
or unrelated to the parameters of interest, viz., the post-strata means. Conditionally, the 
observed sample is like a stratified simple random sample with fixed strata sizes m; (Oh and 
Scheuren 1983). Hence, the estimator 


psc 7; », WI mi (5.4) 


is conditionally unbiased, and its conditional variance is unbiasedly estimated by 


Vi Prem) = ical ~ u\™ (5.5) 


mM. 


t 
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where y,,; and s,,, are the mean and variance of sample respondents in the /-th post-stratum, 
respectively. 

If the W, are unknown, it is a common practice to replace W; in (5.4) by its estimate 
w; = n,/n. In this case, conditional inference can be questioned since the distribution of (7;, 
m,) hacpénds on the unknown weights W; and since W, are involved in the parameter 
Y = ¥W,Y,. If partial information on W,, in the form of bounds on W,, is available, we 
can proceed with conditional inference as in Example 1, Remark 3, although the resulting 
estimator is still conditionally biased (but likely to be better than (5.4) with W; replaced by 
W,). 


6. DOMAIN ESTIMATION (SRS) 


6.1 Domain mean 


Under simple random sampling (SRS), the usual estimator of a subpopulation (domain) 
mean, Y,, is given by the sample mean 


OS St 6.1) 


where s; is the sample falling in the domain and n, is the corresponding size. 

If the domain size, N,, is known, then one should condition on the observed value, 7,. 
The estimator ¥y; is conditionally unbiased if n; > 0 since conditionally s; is a SRS sample 
of fixed size n; from the domain. An unbiased estimate of the conditional variance is 


Wy, ) = pe rau n, > 0 (6.2) 


and the resulting confidence interval y; + Z,/. Vv(;) 1s conditionally correct. 

The estimator y,, however, is unstable for small domains (small areas) with small n;. Also 
y; is not defined if n; = 0. One solution to the latter problem, suggested in the literature, 
is to use a modified estimator. 


qj 
=e a (6.3) 


where a; = 1 if n; = 1; = Oif n; = 0 and J, is taken as Y, if n; = 0. The estimator y/, 
however, is conditionally biased: 


EW) = ao % 


It is an underestimate if n, = 0, and an overestimate if n; => 0, although unconditionally 
unbiased. The extent of overestimation depends on the magnitude of E(a;) = P(n; = 1). 
If, for example, P(n, = "1)°="0.75, then BG) = (Z)Y, un; = 1. 

Sarndal (1984) proposed the following estimator in the context of small area estimation: 


Vis =Y + —O; — JF), 1; = O, (6.4) 
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where y = ) w,y; is the overall sample mean and w, = n;/n. The estimator is approximately 
unconditionally unbiased, but conditionally biased unless w; = W,;: 


BJs) = G Ne Oe (6.5) 


i 


where Y’ = )w,Y;. If n; = 0, the estimator y,; reduces to the ‘“‘synthetic’’ estimator y. The 

extent of under- (or over-) estimation of y,; depends on both w,/W, —1 and Y, — Y’ and 

hence more complex to analyse than the bias of y/. However, ¥,; would have a larger ab- 

solute conditional bias* than y if w, > 2W, (and hence a larger conditional MSE). Also, 

the conditionally unbiased estimator y, has a smaller conditional variance than y,, if w; > 

W; (neglecting the variance of y relative to that of y,) and hence smaller conditional MSE. 
Hidiroglou and Sarndal (1985) proposed a modification of J,<: 


Jara ~s (6.6) 
we= yt a (9, — ¥) if w, < W,. 


i 


The estimator y%* is conditionally unbiased if w, => W;, while its conditional absolute bias 
is smaller than that of y if w; < W;. A motivation for yX* is that the conditional variance 
of y% (or ¥,;) is larger than that of y,; (neglecting the variance of y relative to that of y,) if 
w; > W,, while the conditional variance of y% is smaller than that of y,, if w, < W,. 
However, the absolute conditional bias of y% is larger than that of y,, if w, < W,. Hence, 
the choice between yx and y,, in the case w, < W, is not clear-cut and no simple recipe seems 
to exist. 

Drew et al. (1982) proposed another sample size dependent estimator which depends on 


a parameter K,. In the SRS case and the choice K, = 1, their estimator reduces to 


J, if w; = W; 


t 


Jin = : (6.7) 
Jis if w, < W,. 


As noted above, the choice between y,;, and y& in the case w; < W, is not clear-cut. Con- 
sequently, the choice between y,, and yxX* is also not clear-cut. 

If N, is unknown, the conditional argument may still be relevant provided WN; is unrelated 
to the parameter of interest Y;. It is also relevant when partial information on N, is 
available, such as bounds on N,. 

If a concomitant variable x with known domain mean X, is available, the ratio estimator 


ie (6.8) 


*Sarndal’s estimator, however, should perform better in the case of a one-way model. The estimator is obtained 
by pooling estimators of the form (6.4) over two or more groups. 
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and a regression-type estimator (Battese and Fuller 1981) 
22 Thiel om aes 3 
Vr et CN, — A) (6.9) 
ye 


are both conditionally unbiased (approximately), but y* is likely to be more efficient if a 
regression model (through the origin) with a common slope holds true, at least approximate- 
ly, for the small areas. If the slopes are varying, then an empirical Bayes estimator, which 
is more complex, might be more relevant (Dempster ef a/. 1981). 


6.2 Domain Total 


If N, is known, then an estimate of domain total Y; = N,Y; is simply obtained by 
multiplying a chosen estimator of Y; by N;. On the other hand, the usual unbiased estimator 


i= NI =—L Y;,n,2 1 (6.9) 


is used if N, is unknown, where N; = Nw, is the unbiased estimator of N, and P(n; = 0) 
is assumed to be negligible. 

Suppose now that we have prior information, say N* < N, < N** Then the conditional 
argument may be relevant. The conditional bias of Y, is 


B(Y,) = (N; — N,)Y,. (6.10) 


It follows from (6.10) (assuming Y, > 0) that B,(Y,) > 0, i.e., overestimation, if N, > N, 
and that B,(Y; ) increases as the domain sample size n,; increases. Similarly, B,(Y;) < 0, 
i.e., underestimation, if N, < N, and |B,(Y, )| increases as n, decreases; the conditional bias 
is zero if N, = N.. 

Utilizing the prior information, we can modify Y, as 


Nxy, if N; < Nf 


Y*=( Ny, if N* < N, < N** (6.11) 
N¥*y, 


t 


if WN, SNES 


The absolute conditional bias of Y*is smaller than that of Y, if either N, < N* or 
N, > N**, while Y* = Y, in the interval N*< N, <N**. Hence, Y+* is conditionally bet- 
ter than the unbiased estimator Y,. Also the unconditional MSE of Y*is smaller than that 
of Y,, although Y* is unconditionally biased. Unfortunately, there is no simple way to im- 
prove upon Y* in the range N* < N, < N**. In any case, Y* should be preferred over Y,. 
Good supplementary information on the domain size is necessary in estimating a domain 
total efficiently. 


7. GENERAL DESIGNS 


Post-stratification adjustment is commonly employed in complex large-scale surveys, mainly 
to increase the efficiency of estimators, e.g., the age-sex adjustment in the Canadian Labour 
Force Survey (LFS). A general theory of unconditional inference is also available. 
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The estimator of total Y is given by 


rst = LM, (7.1) 


I 


where Y, and M, are the usual unbiased domain estimators of the i-th post-stratum total 
Y; and size M, respectively. In the LFS, projected census counts are used for the M,. 
The estimator Vex reduces to )) N,y; in the SRS case (see (3.2)) and we have already seen 
that )} Ny; is conditionally unbiased in the SRS case (assuming all n; = 1). However, 
for complex designs it seems difficult to investigate the conditional properties of (7.1); 
even the choice of reference set is not so clear-cut. To illustrate this difficulty, consider 
stratified SRS with L = 2 strata and k = 2 post-strata. If we condition on the observed 
post-strata sample sizes (n,, , ”,) in each stratum A, the theory is straightforward provid- 
ed the post-strata sizes N,; in each stratum are known. However, in practice we will run 
into problems with zero sample sizes n,; and also the sizes N,; in each stratum may not 
be available or the projections inaccurate, although N; = ¥,N,; = M, are available. 
Hence, we may prefer to condition on the observed total sample sizes (n,, n,), where 
p= Sy lars a 
The estimator Y,,, in this special case of stratified SRS (L = 2, k = 2) reduces to 


Jit Y21 1AD Yr 
My, + No. ny IN Te ita arte 

Yin = Ny se Siam eRe das ecadadl ATES (7.2) 
My, t N27 M7, + Na 


where N, = N,, + N,. and n, = n,, + Nn, are the strata population and sample sizes 
respectively, and y,; are the sample totals in the (A, i)-th cell. The conditional expectation 
of (7.2) given (n,, n,) is not tractable since one has to evaluate the sum 


EY; a y P(S,|n4, N»)Y pst) (7.3) 


where s, is a possible sample such that the observed sample sizes fi,,; satisfy 7,; + My = 
MA .1,%2),.and Xe) is the value of (7.2) for the sample s,, and p(s,|.,, 1.) is the con- 
ditional probability of observing s, given (n,, 1): 


. Ni Ny \{ Ny \{ Ny 
ny,=0\ nm, ny —ny Ni =A} \No— ny rn, 


It is clear from (7.3) and (7.4), however, that F,( ye) # Y since Nige does not depend 
on the cell totals N,; unlike p(s,|”,, 1). 
Turning to variance estimation, the usual formula for general designs is given by 


PS,|N1, 22) = (7.4) 


v*(Y,5) = vzi) (7.5) 


30 Rao: Conditional Inference 


where v(y, ) = v(Y) is the usual variance estimator of the estimated total Y, and v(z¥*) is 
obtained from v(Y) by replacing y, by 


A 


us 7.6 
eee) (7.6) 


i 


where a,(i) = 1 if the t-th element belongs to the i-th post-stratum and ai) = 0 otherwise 
(Williams 1962). In the SRS case, (7.5) reduces to 


v*(Pog) = N? (3 2 | ¥n;s?, (7.7) 


(assuming (n,; — 1)/(n—1) = n;/n) which is not equal to (3.3) when multiplied by N’. 
Hence, (7.5) does not behave well in the conditional framework, even in the SRS case. On 
the other hand, a new variance estimator 


WY) = v2), (7.8) 

where 
eM a ees, et 
Li » i OM) - Mi, a((i)) (7.9) 


and y,(i) = y, if the ¢-th element belongs to the i-th post-stratum and y,() = 0 otherwise, 
might be preferable over y"( Yaa) since in the SRS case it reduces to (3.3) when multiplied 
by N* and the finite population correction is ignored: 


OA er (7.10) 


Some theory for ratio estimators under models also suggests that v( YS) might perform bet- 
ter conditionally than v*(Y,,,). In any case, there is no harm in switching to (7.8) since it 
is asymptotically equivalent to the customary variance estimator (7.5), unconditionally. 


8. DISCUSSION 


Our study clearly shows that conditional inference for complex designs involves formidable 
difficulties. Nevertheless, we should not use conventional procedures blindly. In those cases 
where conditional inference is feasible, as in the SRS case, we should certainly employ con- 
ditionally relevant methods as elaborated in Sections 2 - 6, while in the more complex cases 
we should at least make simple modifications to conventional methods, as in (7.8), so that 
they agree with known, conditionally correct results in special cases. Clearly, we need more 
research in this area. 
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Cost-Variance Optimization for the Canadian 
Labour Force Survey 


G.H. CHOUDHRY, H. LEE, and J.D. DREW! 


ABSTRACT 


The cost-variance optimization of the design of the Canadian Labour Force Survey was carried out 
in two steps. First, the sample designs were optimized for each of the two major area types, the Self- 
Representing (SR) and the Non-Self-Representing (NSR) areas. Cost models were developed and 
parameters estimated from a detailed field study and by simulation, while variances were estimated us- 
ing data from the Census of Population. The scope of the optimization included the allocation of sam- 
ple to the two stages in the SR design, and the consideration of two alternatives to the old design in 
NSR areas. The second stage of optimization was the allocation of sample to SR and NSR areas. 


KEY WORDS: Multi-stage designs; Sample allocation; Linear cost function; Components of variance. 


1. INTRODUCTION 


The Canadian Labour Force Survey (LFS) is a monthly household survey conducted by 
Statistics Canada to produce estimates for various labour force characteristics. It follows a 
stratified multi-stage rotating sample design with six rotation groups. Since its inception in 
1945, the survey has undergone a sample redesign following each decennial census of popula- 
tion. These redesigns serve to update the sample to reflect population changes. They also 
provide the opportunity to introduce improved sampling and estimation methodologies, and 
to respond to shifts in information needs to be satisfied by the survey. 

The 1981 post censal redesign effort included a research phase as outlined in an earlier 
paper (Singh and Drew 1981) in which all aspects of the survey design were examined in an 
effort to improve the cost efficiency of the survey vehicle. Highlights of the research program 
were presented by Singh, Drew, and Choudhry (1984). This report deals with the research aimed 
at cost-variance optimization of the sample design. 

The two important factors in the choice of a sample design are the total cost and the reliabili- 
ty of the resulting estimates. The optimum solution can be obtained by minimizing either 
total cost or total variance when the other is fixed. Equivalently, the approach we have followed 
is one of minimizing the product of variance and cost for fixed sample size. 

The cost-variance optimization was carried out in two steps. We first consider the optimiza- 
tion of the sample designs followed in each of the two major area types identified in the LFS 
design; i.e., the SR Areas or major cities, and NSR Areas which are the smaller urban and 
rural areas. The scope of the optimization includes the allocation of sample to the two stages 
of the SR design (Section 2), and the consideration of alternatives to the old design in NSR 
areas (Section 3). For NSR areas the old design is first evaluated empirically via a components 
of variance approach, and one stage of sampling in rural areas is identified for elimination. 
Subsequently the modified old design is compared to an alternative design featuring explicit 
rural/urban stratification from an overall cost-variance perspective. For both types of areas 
variances are obtained empirically using data from the 1971 and 1976 Censuses, while cost 
models are developed using data from a time and cost study, and by means of a simulation 
study. 


'GH. Choudhry, H. Lee, and J.D. Drew, Census and Household Survey Methods Division, Statistics Canada, 
4th Floor, Jean Talon Building, Tunney’s Pasture, Ottawa, Ontario K1A OT6. 
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In Section 4, we consider the second stage of optimization, the allocation of sample to 
NSR and SR areas, taking into account the design improvements identified for each type 
of area. Finally, Section 5 summarizes the improvements identified, and their implications 
on the redesigned sample. 


2. SR DESIGN 


The old SR design is a stratified two-stage design (Platek and Singh 1976). Each Self- 
Representing Unit (SRU) is stratified into a number of contiguous strata called subunits and 
each subunit is subdivided into clusters which are the primary sampling units (PSU’s). The 
PSU’s are selected using the random group method due to Rao, Hartley, and Cochran (1962) 
and at the second stage of sampling, a systematic sample of dwellings is taken in such a man- 
ner that the design becomes self-weighting. Let 1/W be the sampling rate in the stratum and 
n be the number of PSU’s to be selected from the stratum. The N PSU’s in the stratum are 
randomly partitioned into n groups so that the i-th random group contains N,;PSU’s and 
vie N, = N. Let x, and M,, j = 1, 2, ..., N, respectively be the size measure and dwell- 
ing count for the j-th PSU in the stratum. 


Define 


x; 


Nushies 
>» M 


J 


and 6; 1 if j-th PSU is in i-th group 


0 otherwise. 


ll 


Then 7; = Yj_, 6A; is the relative size of the i-th group. Now define W;,’s as 


d; d, 
W, = 6,(W | or 6, (WY! + J (2.1) 
Tj Tj 
such that NW, = W fori = 1, 2, ..., n, where [a] is the greatest integer less than or 


equal to a. Now select one PSU from each of the n random groups independently with pro- 
bability proportional to W;,’s and sub-sample the selected PSU / from the i-th group at the 
rate 1/W,,. Then the overall sampling rate within each of the random groups is 1/W so that 
the design becomes self-weighting with a design weight equal to W. The average sample size 
for the stratum is given by 


1 


M,/W 


where M, is the total number of dwellings in the stratum. Let M,, be the number of dwell- 
ings in the selected PSU / in the i-th group, then m,; = M,/W,, dwellings will be selected from 
the i-th group. The average number of dwellings selected from the i-th group for a given ran- 
dom grouping is 1/W J)’, 6, M; and the average over all possible random groupings is m 
N,/N since the expected value of 6, is N,/N. If N;//N = 1/n, ie., the number of psu’s in each 
of the random groups is the same, then the average sample per selected PSU is m/n = d(say), 
where d will be called the average density for the stratum. Since m is fixed, the sample of 
m dwellings can be elected by varying nm and d such that the product (nd) remains equal to 
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m, the total sample size for the stratum. Our objective here is to obtain d which for a fixed 
sample size minimizes the product of variance and cost. For the optimization we obtain the 
total variance via the components of variance approach and consider a linear cost function 
as described in the following section. 


2.1 Variance Function 
Suppose that we are interested in the total of a characteristic y for the subunit. Let y,, 
be the y-value for the A-th household in PSU j where h/ = 1, 2, ..., N, then the total 
Y = YM, Ye: H is estimated by 
Y=WYy, (2.3) 


where y; is the sum of the y-values for the m; selected households from the PSU selected 
from the j-th group, i = 1, 2, ..., . Ignoring the effect due to rounding involved in defin- 


ing W,, the variance of Y is given by (Rao et al. 1962) 
. Nye N 
Var(Y) = A Bai - V+ ysiw -1—al. — a} (2.4) 
j=, j=l hj 
Mm 
where Y; = De 
hot Y;\" 
S? = opal 
1 he, b» | 
ys N?- N 
erp l ves hl PF, 
gUNUNIRR) 


If N; = N/n, i.e., all random groups have equal number of PSU’s, then 


pe Nici Sb, 
n(N — 1) 


Relative variance of Y defined by Var(Y)/Y’ will be 


Rel. Var(¥) = A cae esha per el =a may “ i} 
= Ap, + (W — I)p, + Ap. — ABs 
= (W — 1)p, + Aly + Be — Bs) (2.5) 
where Wa ee 
Y’j=1 dj 
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| nae 2 

1 Si 

= —yY M,—. 
Be yn 


1» Ha, and p, are the population prameters and are fixed for a particular characteristic. Since 
m = nd and if we assume that N, = N/n then we can write A as 


1 d 
= Ce 
Neue m ) 
: Nines & 
and Rel Var Q) Want ON Sei 
m CNaeah) 
=a + ad (2.6) 
(a; + pb — ps) 
pe Siete) 
Dam? UNE ae 


From (2.6), we observe that from reliability point of view, the value d = 1 (i.e., one dwell- 
ing per PSU) is optimum. But this will have impact on the cost as discussed in the next sec- 
tion. The values of a) and a, for unemployed for Halifax SRU were obtained from 1981 
census data and these are 


Q) = 0.019005, a, = 0.0007972. 


Since a, is very small as compared to aq, the increase in the variance with the correspon- 
ding increase in d will be very small. Next we examine the effect on the cost due to varying 
the value of the average density d. 


2.2 Cost Model 


A simple cost model has been considered to investigate the impact on the cost as the den- 
sity is varied. Due to telephone interviewing in the SR areas, personal visits are only required 
to a PSU during the rotation month and in cases where some households were without a 
telephone or did not agree to telephone interviewing. 

A breakdown of the interviewing cost by telephone and personal visit is available for in- 
dividual interviewers from field operations, but further breakdown of the personal visit com- 
ponent of the cost was required to construct the cost model. For this purpose a special time 
and cost study was carried out in the field for a period of six months (February-July 1982) 
on a random sample of interviewers. The results from the analysis of time and cost data 
are documented in a report by Lemaitre (1983). For the purpose of our cost model, we define 
the following set of parameters 


Co = Fixed costs 

c, = Average cost of dwelling-to-dwelling travel within the same PSU 
Average cost of PSU-to-PSU travel 

= Number of PSU-to-PSU moves per selectd PSU. 


~ © 
hob 
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The fixed cost c) includes the time spent actually conducting interviews whether by 
telephone or in person and the travel cost from home to area and back. The fixed cost c, 
depends only on the total sample size m and not on n, the number of selected PSU’s. Sup- 
pose that there are g, dwelling-to-dwelling moves and g, PSU-to-PSU moves made, then the 
total cost for m dwellings will be 


T = Co + 81C; = 82C>.- (Ze) 
If m is increased then g, will also increase and g, will decrease and vice-versa but (g, + 25) 
should remain constant because the number of moves depends on the sample size m and 
the proportion of households interviewed by personal visit. Then we may write 
£i + 2 1= Om. (2.8) 
From (2.8) we substitute g, in equation (2.7) and obtain 


jh Co + Omc, + g(c, — Cj) 


Co + Omc, + ny(c, — C)). 


Now replacing n by m/d we have 


T = c + Omc, eral aah) 


and cost per dwelling C as a function of average ensity d is given by 


Cilesoepilipch wote(esesvien: (2.9) 
m d 


From Time and Cost Study the parameters c, and c, for Halifax were 0.78 and 2.51 
respectively. These parameters were observed with average density equal to 5 but c, increases 
with d and c, decreases with d. Assuming that the average distance between the units is in- 
versely proportional to the square root of the number of units in an area, we can replace 
c, by c,(5/d)” and c, by c,(d/5)” in our model so that the modified model becomes 


V2 V2 V2 
@ AD daped hie oh (2\ieb G12\ blo. (2.10) 
m d d 5 d 


C,/m is fixed per dwelling cost and does not depend on density and its value was 3.28 from 
Time and Cost Study. The parameter 6 does not depend on the density either and was equal 
to 0.356 from Time and Cost Study. The parameter increases with density because the average 
number of visits to a PSU will increase due to higher density. We have approximated y by 


1 3) d 
= ee (= 
ZL p°) 


where p is the probability of telephone interview for a household in a non rotate-in PSU 
and the value of p was 0.85 as obtained from interviewers’ data. From the cost model (2.10), 
the values of per dwelling cost for d = 2, 3, ..., 10 are given in Table 1 along with the 
relative variances and the products of these two which are the values of the objective func- 
tion to be minimized. 
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Table 1 


Value of Relative Variance, Cost per Dwellings and 
Objective Function for Various Densities (Unemployed) 


Density Relative Cost per Objective 

Variance Dwelling Function 
2 0.0206 3.79 0.078 
3 0.0214 3.79 0.081 
4 0.0222 3.79 0.084 
5 0.0230 3.78 0.087 
6 0.0238 Sota: 0.090 
7 0.0246 3.76 0.092 
8 0.0254 Sg he) 0.095 
9 0.0262 3.74 0.098 
10 0.0270 393 0.101 


As expected, we observe that under the model considered here, the cost per dwelling 
decreases very slowly as the density increases since the fixed per dwelling cost (c)/m) 
dominates in (2.10) due to telephone intervewing. From the previous section we had found 
that the increase in the relative variance is very small as the density increases. As a result 
our objective function is monotonically increasing but the loss in the cost-variance efficien- 
cy with increase in d is small. However it was decided to retain the old density of 5 for the 
redesigned sample on the grounds that lower density would have resulted in more selected 
PSU’s with higher implementation and maintenance costs. 


3. NSR DESIGN 


3.1 NSR Design Alternatives 


Design Alternative D,: Old NSR Design (see Figure 1) 
Key features of the old NSR design (Platek and Singh 1976) were: 


i) Stratification: Economic Regions (ER’s) whose numbers varied from 1-10 per province 
served as major strata. Within ER’s, from 1-5 geographicaly contiguous strata were 
formed, using industry data from the 1971 Census. 


ii) Primary Sampling Units (PSU’s): These were delineated within strata, to. be 
geographically compact areas similar to the stratum with respect to stratification 
variables, and with respect to the ratio of rural to urban population. PSU populations 
ranged from 3,000 to 5,000. In the first stage PSU’s were selected following the ran- 
domized probability proportional to size systematic (RPPSS) method of Hartley and 
Rao (1962). Within PSU’s urban and rural parts were sampled separately. 


iii) Within PSU Sampling: Urbans All urban centers assigned in whole or in part to selected 
PSU’s were included in the sample. The second stage of sampling was a sample of blocks, 
following the RPPSS method. The third and final stage of sampling was a systematic 
sample of dwellings. 
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iv) 


Dy: Old NSR Design D,: Elimination of D,: Explicit 
Cluster Stage Rural/Urban 
in Rurals Stratification 

ER ER ER 
STRATA STRATA 
i} i} 
i} 1 
urban rural 
strata strata 
PSU’S PSU’S 
1 | 
; 
i} | 
urban rural urban rural PSU’S 
| | i} | 
i} i} i} 1 
i} i} i} i} 
urban EA’s urban EA’s urban EA’s 
centers centers centers 
; 
1 | i} ' | i} 
1 | | | | i} 
i} i} ' i i} i} 
clusters clusters clusters clusters 
; 
i} ! if i} i} 1 
i} | I | i} i} 

i} | i} | 1 | 
dwellings dwellings dwellings dwellings dwellings dwellings 
Figure 1. Representation of NSR Design Alternatives. (—— stratification, ----- stage of 

sampling) 


Within PSU Sampling: Rurals The second stage of sampling was a RPPSS sample 
of EA’s. EA’s were then field counted for the purposes of delineating clusters having 
from 3-20 dwellings. The third and fourth stages of sampling corresponded to an RPPSS 
sample of clusters and a systematic sample of dwellings. 


Design Alternative D,: Elimination of Cluster Stage of Sampling in Rurals 


i) 


It would permit shortening of the lead time to select independent samples from the LFS 
frame to 7 months from 13 months, by eliminating the need for counting of EA’s. 


Elimination of the clustering step would reduce sample maintenance costs. 


A priori, the reduction in the stages of sampling from 4 to 3 stages would translate 
into a reduced variance. it was expected that costs, on the other hand, would not be 
very much affected, particularly with the shift to telephone interviewing. 


At an early juncture in the redesign research program a field study was carried out on 
the operational implications of eliminating the cluster stage. Verification of EA listings 
a year later revealed no problems with the quality of listings, and analysis revealed no 
discernable impact on data collection costs. 
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Design Alternative D,: Explicit Urban/Rural Stratification 


The old design with its separate sampling of urban and rural portions of PSU’s featured 
an implicit urban/rural stratification. A drawback of the approach however was that 
maintenance of the stratum urban to rural population ratio at the PSU level required fre- 
quent discontiguity between rural and urban portions of PSU’s, leading in turn to increased 
travelling costs. 

In view of this problem with the old design, design alternative D, was formulated as 
follows: 


i) Stratification: Rural and urban portions of ER’s would constitute primary strata, which 
would be optimally sub-stratified to the point of having strata yields of 100-150 dwell- 
ings (i.e., 2-3 PSU’s each corresponding to an interviewer’s assignment). ER’s not able 
to support at least one such urban and one such rural stratum (roughly Ys of ER’s) 
were considered ineligible for D,. 


Secondary rural strata would be contiguous, while secondary urban strata would be 
formed without geographic constraints. 


ii) Sampling Within Rural Strata: PSU’s similar to the stratum with respect to stratifica- 
tion variables would be formed by grouping geographically contiguous EA’s and will 
be selected by the RPPSS method. Second and third stages of sampling would be an 
RPPSS sample of EA’s and systematic sample of dwellings. 


iii) Sampling Within Urban Strata: Sampling would proceed in three stages as follows: 
RPPSS sample of PSU’s (individual or combined urban centers), RPPSS sample of 
clusters, and systematic sample of dwellings. 


3.2 Variance Components Model 


Design alternative Dj, D, and D, were simulated using census data. Expressions for the 
variance components are given below: 


Stage of Sampling Variance Expression 
Ist Vie ane (3.1) 
N VRPPSS 
2nd Va = WY 7 (3.2) 
rae 
3rd Vay = WY Y —o" if last stage, 
eye We 
(3i3) 
VRPPSS 
= WY Y—™ otherwise 
im a ij 
Vas 
4th Vy =WEDy—™ (3.4) 
(where applicable) Pp ke Wig 


The variance formula and its computation method for the RPPSS sampling are described 
in Appendix A. 
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3.3. Cost Model 


Whereas the cost model for the SR areas dealt with allocation of samples to 2 stages of 
sampling, here a cost model is needed to compare alternative NSR designs. 
The cost model for design D, under personal interviewing was formulated as 


Chante ee B, 


where F, = fixed fee for interviewing, 

fee for home to area, between PSU, and between secondary travel, 

F, = fee for within secondary (dwelling to dwelling) travel, 

E, = expenses associated with home to area, between PSU, and between secon- 
dary travel, 

E, = expenses associated with dwelling to dwelling travel. 


7 
| 


Fees are compensation for the time spent and expenses for the distance covered. All 
Parameters are expressed in terms of per dwelling costs. 
Under telephone interviewing, this was modified to 


Ch, = Pope lh? Pee. Eye e,), 


where «@ is the factor by which time and mileage would be decreased under telephoning. 


Now, under the assumption that D, would affect F, and E,, say by a factor r, but would 
not affect other components we have, 


Cp, = Fo + ar(F; + E)) ae a(f, + £E,). 


Parameters of Cp, and Cb, were estimated as follows: 


F), F,, F,, E,, E,: These were estimated under D, from a special Time and Cost study 
(Lemaitre 1983), carried out as part of the redesign research program. 
Since the field test of D, revealed no discernable differences in data col- 
lection costs between D, and D,, these parameters were assumed un- 
changed under D,. 


a: Field testing of telephone interviewing carried out as part of the redesign 
research program did not have as an objective the estimation of cost sav- 
ings. An estimated 10% reduction in total data collection costs was made 
by Regional Operations staff, which permitted calculation of a. 


r: This parameter could not be estimated based on available data, rather 
a Monte Carlo simulation study was needed, which is described in Ap- 
pendix B. 


3.4 Results of Cost-Variance Analyses 
Variance Analysis: D, vs. Dy 


Components of variance for 6 labour force characteristics were obtained for designs D, 
and D, using 1971 Census data for 5 ER’s across Canada. Table 2 gives the % contribu- 
tion from each stage of sampling to the total variance under Dy. It can be observed that 
30-40% of the total variance under D, was due to the rural cluster (3rd) stage of sampling, 
and that under design D, 20-30% variance reductions could be obtained. 
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Table 2 


Percent Contributions to the Total Variance from Stages of Sampling 
for the Current Design and Percent Reduction in the Total Variance Due to 


Vp 
Eliminating Cluster Stage of Sampling in Rural Areas; 100 (1 — peal $d) 


Do 
Percent Contribution to Total 
Variance from Percent Variance Reduction; 
Characteristic Urban Rural Vp, 
SAS ihe Reta 100 (1 — ——) 
Ist 2nd 3rd 2nd 3rd = 4th VD9 
stage stage stage stage stage stage 

LF Population | Sete VAC wi ORs 358 6 4U, 50 Ls) 30.5 
Employed 21.2 ll 25 el OL4 6:3 su3),0dy 15.8 21.1 
Unemployed 12:6. 15.80 16:6 48° 33.0) 17:2 24.8 
Not in LF 24.7. Poll Ie pO a, Cr ages 8? ie SPH 22.9 
Employed Agr. 42.4 LO fy 0.8 gel 23307 R30. Sh 2I6 20.4 
Employed Non-Agr. 23,5, keen 30. olvh «14.8 2158 


The gains might be less since for the study, the variables being estimated and the size 
measures referred to the same point in time whereas this would not be true in practice. No 
attempt was made to discount the gains, however, since the choice between D, and D, was 
clear both in terms of variances, and on operational grounds (as discussed in Subsection 3.1). 
Further efforts were devoted hence to the choice between D, and D),. 


Variance Analysis: D, vs. D, 


In this study the number of ER’s was expanded to 11, and study variables (employed and 
unemployed) were based on the 1976 Census, whereas size measures were based on the 1971 
Census. Also variances were computed with ratio estimation based on total population. 

The average variance efficiency of D, with respect to D, was 1.16 for employed and 0.97 
for unemployed (Table 4). 


Cost Analysis: D, vs. D, 


Values of all the parameters in the cost model are presented in Table 3 along with Cp, 
and Cp, and their ratio. 

As expected the between PSU and between secondary component of interviewer fees and 
expenses are higher under D, due to the frequent lack of contiguity between rural and ur- 
ban portions of PSU’s. The average reduction factor r in these components under D, was 
estimated as in Table 3 leading to an overall cost efficiency for D, vs. D, of 1.08 (Table 4). 


Combined Cost Variance Analysis: D, vs. D, 


Table 4 gives the relative cost-variance efficiencies of D, vs. D, under telephone inter- 
viewing. In terms of overall efficiency, D, is 25% and 5% more efficient than D, for 
employed and unemployed respectively. 

Based on these findings it was decided to adopt D, in the 2/3 of ER’s capable of sup- 
porting both urban and rural strata, and design D, was adopted in the remaining cases. 
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Table 3 


Values of Parameters in the NSR Cost Model and Relative Cost 
Efficiencies of D, vs. D, with Telephone Interviewing 


ER Fy F, F, E, E, a r Ch, Cp, Cp,/Cd, 
22 205 meu Ave Sled t ee O1952 029). 0.85 60.93. 2 45.38 5.28 1.02 
32 2.13 0.86.40 1rd bs 10:900mrt0:97 2900.84 OF 0.88.9. 7) 53512 35<87 1.03 
41 2.04 0.94 0.94 0.96 0.69 0.84 0.42 5.01 4.08 1.23 
44 2.04 0.94 0.94 0.96 0.69 0.84 0.50 5.01 4.21 1.19 
51 1.9499.0.80" © 1.07 0:81 0.75 0:84.95 0.89) 4.82) 4.67 1.03 
56 [of 720.80, 7 1.07" 0.8l 0,75. 0.84. 0.68. «14.82. 4,39 1.10 
63 2074 Wil Oden dl O3uie baliclD uO Ol aat 0.7 Sani O'8 Le nee5.66 he 5.41 1.05 
72 192 6t0;96 Mee 13) per iO5ee 1309 MO. Soci OS 2-e 15 52 IF = 5201 1.06 
82 teSSeamels Loant eOlntinl-20estsO4 tere C-SOsin a ad iad 4.69 1.18 
86 SSM LIDS ML 1L01 IA 11.207 4 40;94.5. 170;8618 8°0;901O"2515504 185.35 1.04 
96 QOS SOS Liv) 1O2SUOORS IM 2O.85teU 0:84 tine O75 RESi07 e474 1.07 
Table 4 


Relative Cost-Variance Efficiencies of D, vs. D, 


; ieee Relative Cost-Variance 
Variance Efficiency 


Yo,/Po Efficiency ae 
Gio D,~ Dj" © Dy~D2 
ER Employed Unemployed Employed Unemployed 
22 1.09 0.93 1.02 has 0.95 
32 0.91 0.72 1.03 0.94 0.74 
41 1.14 0.86 1.23 1.40 1.06 
44 1.39 1.14 1.19 1.65 137 
51 0.96 1.01 1.03 0.99 1.04 
56 Pie 1 1.10 i723 1.66 
63 1335 1.06 1.05 1.41 el 
72 1.00 0.91 1.06 1.06 0.96 
82 1.09 1.01 1.18 127 1.19 
86 1.20 1.05 1.04 i275 1.09 
96 1.38 1.05 1.07 1.48 1.12 
All* 1.16 0.97 1.08 125 1.05 


* Weighted average by population size. 
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3.5 Special 2-Stage Design for Prince Edward Island 


For Canada’s smallest province, Prince Edward Island, where sampling rates of 4% are 
required in order to produce reliable provincial data, design alternative D,, a stratified sam- 
ple of EA’s and dwellings, was considered as an alternative to D,. 

D, did not feature any clustering of the sample into geographically contiguous primaries 
designed to correspond to interviewers assignments, as it was hypothesized that given the 
high sampling rates, the increase in data collection costs might be more than offset by variance 
reductions due to elimination of a stage of sampling, and due to stratification gains resulting 
from having more strata (i.e., up to 4 times as many as under D,). 

Cost-variance study results showed the variance efficiency of D, vs. D, to be 2.39 for 
employed and 1.20 for unemployed, while costs under D, were only 8% greater. Hence, bas- 
ed on overall cost-variance efficiencies of 2.21 for employed and 1.11 for unemployed, D, 
was opted for. 


3.6 Number of PSU’s Selected Per Stratum 


Under both designs D, and D,, the sample yield per PSU was fixed at 55-60 dwellings 
to correspond to an interviewer’s assignment. In about half of the ER’s, there was only enough 
sample for 2 or 3 PSU’s to be selected. Further stratification in these cases was ruled out 
on the grounds that there should be at least 2 PSU’s per stratum to permit unbiased estima- 
tion of variance. 

For the remaining ER’s, some consideration was given to having 4-5 PSU’s per stratum, 
as this would permit greater flexibility to reduce the size of the area sample, for example, 
if a portion of the area sample at some time in the future were to be converted to a telephone 
sample under a dual frame set-up. However, stratification to the point of 2-3 PSU’s per 
stratum was adopted, based on variance reductions of 14.8% for employed and 5.4% for 
unemployed for these ER’s. A detailed description of the stratification procedures followed 
can be found in Drew, Bélanger, and Foy (1985). 


4. COST-VARIANCE OPTIMIZATION BETWEEN SR and NSR AREAS 


The next step in the cost-variance optimization of the LFS design was the optimization 
of the allocation of sample between SR and NSR areas. We used the simple cost and variance 
models considered by Fellegi, Gray, and Platek, (1967), i.e., 


cost: ES y C fi (4.1) 
° = tet j W, ; . 
é 2 
variance: Vee ye WP Ores (4.2) 
j=1 
where j2=Carea_type (=H foriSR3—22eforsNSkR); 
C, = unit (i.e., per person) cost, 
P; = population, 
1/W; = sampling rate, 
o? = unit variance. 


Fellegi et al. showed that if C is minimized with V fixed the ratio of the sampling rates is 


Wi» @ fe) (4.3) 


W, 0; 


C) 
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The other optimization criteria described in Section 1 also give the same ratio as above. 
Parameters were estimated as follows: 


(i) Unit costs: Historical per dwelling costs by type of area were available. These were decreas- 
ed by 10% for NSR areas, to take account of the estimated effect of a shift to telephone 
interviewing of all rotation groups except the rotate-in group for the redesigned sample. 


(ii) Unit variances: Optimization was carried out with respect to the characteristic unemployed, 
for which variances were given by: 


eee! 


P, 


J 


ee 


where 8. = design effect for unemployed, and uw; = unemployed. 
J J 


Historical design effects by type of area were available, and were reduced to take into 
account of structural improvements in the respective NSR and SR designs as described in 
Sections 2 and 3. Unemployment levels were based on 1980-82 average LFS data, which seemed 
appropriate in light of medium term forecasts which were not calling for a return to pre-1982 
recession levels of unemployment, and population counts were based on the 1981 Census. 

Table 5 presents the percent of sample in SR areas under the following allocations: (i) 
old design, (ii) proportional allocation, (iii) optimum allocation under the assumed cost and 
variance model, and (iv) the allocation adopted for the redesigned sample. The optimum 
allocation could not be adopted because of subprovincial data reliability constraints. In most 
cases, the differences between the optimum allocation and the one adopted are small. The 
optimal allocation turned out to be quite close to proportional, and quite different from 
the allocation under the old design. 


Table 5 
Percent of Sample in SR Areas within Provinces for (1) Old Sample, 
(2) Proportional Allocation, (3) Optimum Allocation, 
and (4) Redesigned Sample 


Beatince Old Proportional Optimum Redesigned 
Sample Allocation Allocation Sample 
Newfoundland 41.8 ey 8) 42.6 44.6 
Prince Edward Island 26.6 32.8 32.8 25.9 
Nova Scotia 373 57.4 58.8 51.9 
New Brunswick 49.5 52.5 47.4 53.6 
Quebec 56.8 74.8 TAG6 68.9 
Ontario 625 19.1 78.8 75.0 
Manitoba 54.1 71.0 76.4 56.4 
Saskatchewan 44.7 ay) kate) 62.1 56.8 
Alberta 60.0 68.6 72.6 62.3 
British Columbia 58.0 78.0 74.6 69.7 


Canada aye Oust 67.4 62.3 
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Table 6 


Relative Efficiency of the Redesigned Sample Allocation 
with Respect to the Old by Province (Unemployed) 


Cost Ratio Variance Ratio Rel. Eff. 
Province ie CO) (= VO) fe COVO 
cM VW CHV 
Newfoundland 1.00 1.00 1.00 
Prince Edward Island 1.01 1.02 1.03 
Nova Scotia 1.04 1.14 1.18 
New Brunswick 1.01 0.98 0.99 
Quebec 1.03 1.06 1.09 
Ontario 1.04 1.08 PAZ 
Manitoba 1.01 1.03 1.04 
Saskatchewan 1.05 1.06 £12 
Alberta 1.01 1.01 1.02 
British Columbia 1.02 1.09 es 
Canada 1.03 1.07 1.10 


The projected gains resulting solely from the re-allocation process under the assumption 
of fixed (old) provincial sample sizes and uniform sampling rates within the two area types 
are presented in Table 6. For this table, the unit costs and variances described above were 
used in determining the total costs and variances, C®, CY, V, V™, under the old and 
new allocations respectively. The new allocation would have resulted in a 3% decrease in 
total cost and a 7% decrease in total variance of unemployed and for a combined relative 
efficiency (as defined in Table 6) of 1.10. Had it not been for the subprovincial data re- 
quirements, an efficiency gain of 1.12 could have been achieved under the optimal allocation. 

The actual efficiency gains for the redesigned sample vs. the old sample are considered 
in the following section. 


5. CONCLUSIONS 


The changes in the LFS design taken as a result of the cost-variance studies are the follow- 
ing: elimination of a stage of sampling in NSR rural areas, adoption of a design featuring 
rural/urban stratification, adoption of a 2-stage NSR design in Prince Edward Island, in- 
crease in the number of NSR strata to the extent that only 2 or 3 PSU’s per stratum will 
be selected, and re-optimization of the allocation of sample between NSR and SR areas. The 
near optimality of other design parameters established earlier by Fellegi, Gray and Platek 
(1967) was found to have remained unchanged, for example the number of dwellings to select 
per PSU in SR Areas. 

The efficiency gains resulting from the changes permitted a 7% reduction in the overall 
LFS sample size and achieved the required reliability of subprovincial data (Singh et al. 1984) 
without impacting on the reliability of provincial and national estimates. The only excep- 
tions were the provinces of Quebec and Manitoba, where greater subprovincial data demands 
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Table 7 
Relative Efficiency of the Redesigned 


vs. the Old Sample for Unemployed 


Cost Ratio* Variance Ratio Rel. Eff. 
Province a CO) ) _ VO ec CO) VO ) 
CO VO) CHV) 
Newfoundland 1.19 1.00 V.19 
Prince Edward Island 1.10 iM i) 1.24 
Nova Scotia | eres 1.04 V2) 
New Brunswick 1.17 0.99 1.16 
Quebec j bal Us) 0.95 1.09 
Ontario | a 1.03 1.16 
Manitoba F177. 0.96 Dale 
Saskatchewan 123 1.02 25 
Alberta** | Lal 1.00 1.15 
British Columbia | be) be) 1.01 1.16 
Canada 1.17 0.99 1.16 


* Based on the redesigned sample with telephone interviewing and the old sample with 
personal visit interviewing in NSR areas. 
** Supplementary sample not included. 


necessitated a slight loss in provincial data reliability. Table 7 gives the cost, variance and com- 
bined cost-variance ratios for the old sample (old design with 55,500 hhlds/month and no 
telephone interviewing in NSR’s) vs. the redesigned sample (new design with 51,600 hhlds/month 
and telephone interviewing). The significant cost reductions are due to the shift to telephone 
interviewing in months 2-6 in NSR areas, and the sample size reduction. The overall cost- 
variance efficiency of the redesigned sample relative to the old sample was 1.16 (Table 7). 


APPENDIX A 


Variance Formula and Computation Method for RPPSS Sampling 


Suppose that a sample of size n is selected by the randomized PPS systematic sampling 
from N units. Let p, be the normalized size measure of the i-th unit such that )%, p;=1. 
The Horvitz-Thomson estimator of the total Y for a characteristics y is given by (Horvitz 
and Thomson 1952): 


Where S = the selected sample of size n 
y; = y-values of i-th unit 


™; = np,;, the probability that the i-th unit is in S. 
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and its variance is 


;, ee ele ees) jf \ 2 
Vikan) = % E (mim — male — Ze) 
where 7; is the joint probability that both the i-th and j-th units are in S. Hartley and Rao 
(1962) gave an asymptotic formula for 7,,’s. 


An exact formula by Connor (1966) is also available but quite involved. Recently Hidiroglou 
and Gray (1980) developed a computer algorithm using a modification of Connor’s formula 
due to Gray (1971), which was used in our study and compared with the Hartley-Rao ap- 
proximation. It was found that the Hartley-Rao approximations are very close to the exact 
values for N = 16. We decided to use the Hidiroglou-Gray algorithm for N < 16 and the 
Hartley-Rao approximation for N = 16 considering exponential increase in computation with 
the algorithm as N increases. 


APPENDIX B 


Cost Simulation of D, vs. D, 


In order to estimate r, the ratio of fees and expenses for travel from home to area, bet- 
ween PSU’s, and between secondaries under NSR design alternatives D, and D,, a Monte 
Carlo study was carried out. The sample frames under D, and D, were simulated to the level 
of secondaries using Census data for each of the 11 study ER’s. Fifty samples were drawn 
following each design, and the selected secondaries for each sample were grouped into 
geographically optimal assignments. If M“ and M® are the average measures of within 
assignment geographic dispersion under designs D, and D,, then r was estimated by 


M?/M® | 


The M-measure for a given sample was defined in the following manner. Suppose that 
k interviewers cover an ER and G; = {U,; 7 = 1, 2, ..., n;} is the i-th interviewer’s assign- 
ment, with n; second stage sampling units. Let (x;, y,) be the population centroid of U, 
defined in Euclidean coordinates. The M-measure for the ER is defined as 


k 
M= JY M,, 
i=l 


M; = du {(%jj — xy + (ae ees 
jz 


where (x;, Y)) is the center of Gj, 1.e., %, = 1/n; Vit, %y3 We = VW/n, VIE; Vy - 


The determination of optimum interviewer assignments, that is the minimization of the 
M-measure, reduces to a classification or clustering problem. The following clustering 
algorithms were investigated: 


i) Friedman-Rubin (1967) Transfer Algorithm 


This non-hierarchical algorithm which was adopted for stratification of the LFS sample 
(Drew et al. 1985), starts with a random partitioning of units and proceeds towards a 
local optimum by moving one unit at a time from one cluster to another if the move 
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reduces M. It also checks that size constraints are not violated before moving a unit. An 
approximation to the global optimum is achieved by taking several initial random starts. 
A disadvantage of the Friedman-Rubin algorithm in this case was that the strict size con- 
straints required in order to have approximately equi-sized assignments, restricted the move- 
ment of units between clusters. 


ii) Dahmstrom-Hagnell (1975) Exchange Algorithm 


This algorithm is similar to the Friedman-Rubin algorithm, except that it is based on ex- 
changing pairs of units between clusters as opposed to transfering individual units. Hence 
it works better under strict size constraints. 


iii)Combined Algorithms 


Define a cycle of a combined algorithm as application of the exchange algorithm, follow- 
ed by the transfer algorithm. Then we considered both single and two cycle combined 
algorithms. 


The combined two cycle algorithm worked best, requiring the smallest number of ran- 
dom starts and the least computing cost to achieve the same level of optimality as the 
other algorithms. Performance of the 1 and 2 cycle combined algorithms based on 21 
replicates is summarized below. 


One Cycle Two Cycle 
No. of Random Starts No. of Ramdon Starts 
1 2 4 Oe 1 2 4 
M-measure* 336.18 329.19 325.65 By AS| 327.55 325.69 325.51 
Standard Deviation 15.84 15.45 15.67 15.69 16.10 15567 15.69 
Computing Cost ($) 5.94 11.24 21.67 53.90 8.17 15a 29.38 
* Average over 21 replicates. 
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Performance of ARIMA Models in Time Series! 
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ABSTRACT 


This study is mainly concerned with an evaluation of the forecasting performance of a set of the most 
often applied ARIMA models. These models were fitted to a sample of two hundred seasonal time 
series chosen from eleven sectors of the Canadian economy. The performance of the models was judg- 
ed according to eight variable criteria, namely: average forecast error for the last three years, the chi- 
square statistic for the randomness of the residuals, the presence of small parameters, overdifferenc- 
ing, underdifferencing, correlation between the parameters, stationarity and invertibility. Overall and 
conditional rankings of the models are obtained and graphs are presented. 


KEY WORDS: X11-ARIMA; Ranking; Priority; Criteria 


1. INTRODUCTION 


Our socio-economic environment is unstable and uncertain; inflation, recessions, and in- 
creasing pollution are among the factors contributing to increasing instability. We try to resolve 
the problem by using a method of forecasting that permits us to evaluate the impact of the 
frequent changes. ARIMA models (Box - Jenkins, 1970) are flexible enough to deal with 
such frequent changes in time series. 

The purpose of this paper is to study a set of eight criteria which when applied to the 
Box-Jenkins method permit an evaluation of the fitting and forecasting performance of a 
set of the most often applied ARIMA models to Canadian economic time series. The ques- 
tion of which models perform well is important for programs like the X-11-ARIMA (Dagum 
1980) which automatically fits a fixed small set of models (three models in the case of the 
X-11-ARIMA) to the series. 

Section 2 introduces eight criteria: the average forecast error for the last three years, the 
chi-square statistic for the randomness of the residuals, the presence of small parameters, 
overdifferencing, underdifferencing, correlation between the parameters, stationarity and 
invertibility. Section 3 discusses the criteria and summarizes the results. Section 4 ranks the 
models conditionally and unconditionally. Section 5 compares within-sample and out-of- 
sample extrapolated values for the last three years. 


2. THE CRITERIA 


In this section we give a brief discussion of the eight criteria used in ranking the models. 


! Presented at (1) Business and Economic Forecasting Session of the Canadian Operational Research Symposium, 
Ottawa, May 1984 and (2) Business and Economic Statistics Section of the American Statistical Association Meetings, 
Philadelphia, August 1984. 

2K. Chiu, J. Higginson, and G. Huot, Time Series Research and Analysis Division, Statistics Canada, Tunney’s 
Pasture, Ottawa, Ontario, Canada K1A OT6. 
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Stability 
The stability condition of a process Z, is either ‘‘stationary’’ or ‘‘non-stationary’’. It in- 
dicates how well the system remembers the shocks a,_;, j = 1, 2, ..., and how fast or slow- 
ly the response of the system to any particular shock decays. For a process 
Z, = A, + WyQ,_, + YQ). + . 


aa Y(B)a, ’ 


where a, ~ NID(0, 02), the filter is said to be stable if the sequence {y;} is convergent. For 
a general ARIMA model (p, d, q), 


¢(B) (1 — B)’Z, = (B)a,, 
the stability condition is that all the \; of the characteristic equation 
¢(B) = 1 — $,B — ¢,B’ — ... — $B’ = (1 — ),B) (1 — 2B)... (1 — A,B) = 0 
for the process are strictly inside the unit circle, i.e. |A;| < 1. 


Invertibility 


The process Z, may be expressed as: 
thy => a, + TiZas + mL, +) Se 


The system is said to be invertible if the sequence {7;} is convergent. The criterion is con- 
sidered to be of primary importance because if the invertibility condition fails, the generating 
function z(B) of the z’s increases without bound. This means the current event of the system 
depends more on events in the distant past than in the recent past, and the process is physically 
meaningless. 

The invertibility condition for a general ARIMA model (p, d, q), is that the »; of the 
characteristic equation 


(B) = 1-0,B — 6,8? = ... — 0,87 = (1 — »B) (1 = »B)... (l= 7B) = 0 
for the process are strictly within the unit circle, i.e. |v;|<1. 


Underdifferencing 


In the AR(p) model, when one or more of the ),;, say 4, approaches 1; then from 


$(B) = 1 — $B - $B — ... -9,B? 


C= Ni B) (Le Np) ie Ns eer eA) 
=i (lsh, B) ee Npe BC ae N Deeer l =X BY (exe): 
we have $(B) approaching 


(1 = $B — OB = «3. = 6) eB aii lbese B). 
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Therefore, a differencing operator may be needed for this system, and the AR(p) model 
becomes an ARI(p — 1, 1) model. Furthermore, when ), approaches 1, we may have 
non-stationarity. 


Overdifferencing 


Consider the general ARIMA model (p, d, q) (P, D, Q),, 
o(B)®(B) (1 — By — B’)?Z, = 0(B)O(B)a,. 


If any v; of the characteristic equation 0(B) = 0 approach 1, i.e. if any (1 — »,B) approach 
(1 — B), we can eliminate (1 — B) from both sides. 


Test of randomness for the a,’s 


Correlation in the residuals is not desirable since we want an unbiased estimate of the 
parameters for the process. 
The statistic 


Q= nin + 2)¥ (n~ beh 


as modified by Prothero and Wallis (1976) and Ljung and Box (1978) from the Chi-square 
test of Box and Pierce is used. 

Here n is the sample size, kK = 1, 2, ...,m are the various lags, and 0, are the autocor- 
relations. Q is used for the testing of the randomness of the residuals. 


Small Parameters 


Generally speaking, when the number of parameters of a given model is increased, the 
mean sum of squares o2 is reduced. However, only large parameters, or those parameters 
significantly different from 0 can contribute to a significant reduction of 02. To check for 
a small parameter, we may need an F-test (Pandit and Wu 1983): 


Aja AG Ay 
F = —— + —— ~F(s, N - nr) 
S N-r 


where r is the number of parameters of the model and s is the number of parameters which 
are restricted to zero. N is the number of observations, A, is the smaller sum of squares 
of the restricted model, and A, is the larger sum of squares of the restricted model. 

But in our study here, we choose two constants, 0.05 and 0.10, as our indicator of the 
presence of a small parameter. 


Correlation of the Parameters 


High positive or negative correlation between parameters reflects ambiguity in the estimated 
values since a range of parameter values results in models with equally good fit. Therefore, 
if some of the elements in the correlation matrix of estimated parameters are large in ab- 
solute value, say greater than or equal to 0.9, the model may be reduced by deleting some 
of the smaller parameters. 
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Forecasting Error 


No matter how we define a good model or bad model, we still have a primary interest 
in the forecasting error of the model. In this paper we use the mean absolute percentage 
forecasting error of one-year-ahead forecast 


Teron Z AL) 
| eee 00% 


where fis 12 or 4, and Z((f) is the forecast with lead time 2. 


3. EVALUATION OF THE ARIMA MODELS 


The eight criteria have been put into two groups. The first group considers good fitting 
of parsimonious models while the second considers the quality of the forecasts. This distinc- 
tion between fitting and forecasting is important; good fitting and good forecasting are not 
equivalent. 

These criteria have been used to evaluate and rank seven of the most often applied ARIMA 
models, namely: 


a (Ont LOR a): Se (1,715 0) COs 1 al). 
s (Oni 2 )e(On teal). 6.228 Ik 0)(0 e151): 
360552572)" 2denl); Tot (2551 5,0) (0, alee 2), 
me 2591532) (OF Nel): 


hwWNe 


where ‘‘s’’ is 12 if the series is monthly and 4 if it is quarterly. 

These models were fitted to a sample of 167 monthly seasonal time series chosen random- 
ly from eleven sectors of the Canadian economy: national accounts; labour; prices; manufac- 
turing; fuel, power and mining; construction; food and agriculture; domestic trade; external 
trade; transportation; and finance. About 40 quarterly time series from national accounts 
and finance were also tested. 

The series are mostly multiplicative, according to the Bell Canada model test (Higginson 
1976). That is, the different components (trend-cycle, seasonal, and irregular) are multiplied 
together to produce the raw series. Therefore, the amplitudes of the seasonal component 
frequently increase with increasing levels of the trend. The multiplicative series received a 
logarithmic transformation before the first three and last three models were fitted. The fourth 
model was fitted to the untransformed series in all cases. 

Looking at the non-seasonal part of an ARIMA model which is associated with the trend- 
cycle and extremes, we see that the models can be grouped into three classes. Class I is models 
1, 2 and 3 whose ordinary part includes only one or two first differences and one or two 
moving average parameters. Class III includes models 5, 6 and 7 whose ordinary part in- 
cludes only one first difference and some autoregressive parameters. Model 4 (Class II) forms 
a class by itself; its non-seasonal part is mixed. We see that the seasonal part of all models 
is the same except for model 7. 

Although the eight criteria are analysed separately in this section, several of them are depen- 
dent. For example, we shall see that the excess of parameters in model 4 generates problems 
of nonstationarity, noninvertibility, under- and overdifferencing, and correlation. 
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In Sections 3 and 4, we test within-sample extrapolated values for the seven ARIMA models. 
That is, the models are fitted to the whole series thus providing the parameters to be used 
for calculating the forecasts for the last three years. This is the way ARIMA forecasts are 
evaluated in the X-11-ARIMA program. 


3.1 Criteria for Fitting Parsimonious ARIMA Models 


The stationarity condition requires that all the roots of the autoregressive characteristic 
equation be inside the unit circle. We see in Table 1 that non-stationarity occurs only for 
model 4, in three cases. These appear to be due to overparametrization of the model. 

In order for the model to be invertible, it is necessary that the roots of the moving average 
characteristic equation be inside the unit circle. Only model 4 has many cases of noninver- 
tibility, 20%, as we see in Table 2. Two explanations are possible. There is first of all the 
case of straightforward noninvertibility. In some other cases noninvertibility was accompanied 
by nonstationarity. The fact that the autoregressive part may have roots near unity might 
have caused autocorrelation in the residuals. The moving average parameters would then 
take higher values to compensate. 

An important criterion in judging the appropriateness of the ARIMA models for the series 
is the chi-square test of Box and Pierce (1970) (modified by Prothero and Wallis in 1976, 
and by Ljung and Box in 1978), applied to the autocorrelation of the residuals. Table 3 shows 
for each of the seven models the number and the percentage of series that fail the chi-square 
test at different levels. We see from this table first, that within a given class of models the 
simpler models have higher failure rates and second, that the failure rate depends to a large 
degree on the class of the model. The first point is illustrated by models 2 and 6 which having 
one more parameter than models 1 and 5, have a higher number of series passing this test. 
The evidence for the second point is that moving average models appear to satisfy the 


Table 1 


Failure in Stationarity 


CLASS I CLASS II CLASS III 
CRITICAL Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 
VALUE ©1001!) 12011) ©2261) 212361) 1961,1 21,001, (1,0 @,1,2) 
-- -- -- oe -- -- -- 3 2% -- -- -- -- -- -- 

Table 2 
Failure in Invertibility 

CLASS I CLASS II CLASS III 
CRITICAL Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 
VALUE 011011) ©1261 ©2261) 21276140) 419611 21001, (1,0 @1, 2) 
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Table 3 
Failure in Chi-Square 


CLASS I CEASSmI CLASS III 
CRITICAL Model 1 Model 2. Model 3  Model4 ModelS Model6 Model 7 
oe O1DOGOLYD G141261) ©2261) 212611) 019611 2190010 (1,0 1,2) 
1% 31 19% 18 11% 29-17% 26 16% 62 37% 21 13% 20 12% 
5% 45 27% 36 22% 46 28% 41 25% 82 49% 49 29% 42 25% 
10% 61 37% 48 29% 56 34% 55 33% 89 53% 60 36% 56 34% 
15% 2 43% 57 34% 69 41% 66 40% 101 «60% 11 43% 64 38% 
20% 83 50% 62 37% 80 48% 16 46% 106 64% 80 48% 73 44% 
30% 100 60% 11 46% 94 56% 88 53% 119 71% 95 57% 89 53% 
40% 111 66% 97 58% 107 64% 99 59% 127 76% 104 62% 100 60% 
50% 121 72% 106 63% 118 71% 113 68% 135 81% 117 70% 116 69% 
60% 131 78% 121 72% 128 77% (129 77% 141 84% 127" 96% 73H 


chi-square test better than autoregressive models. This may be due to the presence of ex- 
tremes in the series. At the 5% level for example, model 1 fails for 27% of the series com- 
pared with 49% for its autoregressive counterpart model 5. As well as all models of class 
III, the mixed model, class II, is inferior to the second model of class I. 
Underdifferencing occurs when a root of the characteristic equation of the autoregres- 
sion polynomial is close to unity, say a distance ¢ from unity. Here & is set equal to 0.1. 
We see in Table 4 that only model 4 is underdifferenced. This may be attributed to over- 
parametrization. Model 4 has two autoregressive parameters and two moving average 
parameters in its non-seasonal part. Just through the estimation, there is a moderate chance 
that at least one of the autoregressive parameters will be greater than or equal to 0.9. 
In this discussion the critical levels chosen for overdifferencing are 0.90 and 0.95. Table 
5 shows that models 3 and 4 are most often overdifferenced. Model 3 has two first differences 
and two non-seasonal moving average parameters. If the second first difference is not 
necessary, autocorrelation is created in the series that has been differenced once already. 
The moving average polynomial will model this introduced autocorrelation by having one 
of its roots close to unity. We can therefore simplify the model by eliminating one moving 
average parameter and one difference. As to model 4, this may be due to overparametrization. 


Table 4 
Failure in Underdifferencing 


CLASS I CLASS II CLASS III 
CRITICAL Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 
VALUE GLDVGLY) G@12610) 022011 2142610 100611) 21,0011) (,1,00,1,2) 


90 = = _ a = = 14 8% = = = oo = 
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In ARIMA modelling of a stochastic process, it is enough to consider the first two moments, 
that is, the mean and autocovariance. The test on the size of the parameters serves only to 
eliminate those that contribute very little or nothing to the explanation of the autocovariance. 

Table 6 illustrates two things. First, the simplest models pass this test better than more 
complicated models. After a logarithmic transformation, most of the multiplicative series 
in the sample will follow a straight line fairly closely (except for seasonal variation), so a 
“first difference’? model will fit them using few parameters. Adding an extra unnecessary 
parameter to the model will often result in its receiving a small estimate from the estimation. 
Second, the estimated values of the moving average parameters are small (less than .05 or 
.10) more often than the estimated values of the autoregressive parameters. For example 
at the level of 0.05, the second autoregressive parameter in model 6 is judged unnecessary 
13% of the time compared with 29% of the time for the second moving average parameter 
in model 2. Similarly, the addition of a second seasonal moving average parameter increased 
the failure rate from 13% in model 6 to 43% in model 7. 


Table 5 
Failure in Overdifferencing 


CLASS I CLASS II CLASS III 
CRITICAL Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 
VALUE 610011) 6142611) ©2261!) 212611) 01961,) 21,0961,1) 21,0 @1,2) 
90 8 5% 117% 43 26% 50 30% 7 4% 9 5% 14 8% 
95 3 2% 6 4% 19 11% 37 22% 3 2% 3 2% 6 4% 
Table 6 
Failure in Small Parameter 
CLASS I CLASS II CLASS III 
CRITICAL Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 
VALUE 01,1)O11) 6120611 022611 214261) 1490611 21901) 1,0 ©,1,2) 
05 15 9% 49 29% 21 13% 42 25% 12 7% 22 13% 72 43% 
10 26 16% 88 53% 43 26% 73 44% 31 19% 45 28% 114 68% 
Table 7 
Failure in Correlation 
CLASS I CLASS II CLASS III 
CRITICAL Model 1 Model 2 Mode! 3 Model 4 Model 5 Model 6 Model 7 
VALUE 


GO1L0619) G41236414) 6423961) 212761) G1490@G11 @1,00,1,1) @,1,0) ©, 1,2) 


ze = “ yO 86 51% 124 74% = = = = = ox 
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High positive or negative correlations between parameter estimates are undesirable and 
reflect ambiguity in the estimation situation since a range of parameter combinations result 
in models with equally good fits. Table 7 shows that only models 2, 3 and 4 fail the correla- 
tion test, i.e. the absolute value of at least one of the correlations is => 0.90. The problem 
is minimal for model 2, and serious for models 3 and 4 where 51% and 74% of the fits had 
highly correlated parameters. This may be due to overdifferencing in model 3 and the presence 
of too many parameters in model 4. 


3.2 Criterion for Extrapolation of ARIMA Models 


This criterion attempts to ensure the quality of the forecasts of the ARIMA models. We 
require that the average percentage forecast error of the fitted error be below a certain level. 

Table 8 shows that six of the seven models are equivalent from the point of view of 
forecasts, i.e. the number of autoregressive and moving average parameters does not affect 
the forecast error of the model averaged over all the series. Of course, some models perform 
better for certain series. 

Table 9 shows the average forecast error and standard deviation of the error under two 
possible outcomes: passing and failing the forecast error criterion. Not only is the failure 
rate of model 3 higher than that of the other models, but the table shows that when it fails, 


Table 8 


Failure in Forecast Error 


CLASS I CLASS II CLASS III 

CRITICAL Model 1 Model 2 Model 3 Model 4 $$Model5 Model6 Model 7 
Magee O1LYDGLY) GO12014) ©2261) 212761) @10G61) 21901) 1,0901,2) 

% % % % % % % % 

10 89 53 84 50 101 60 80 48 84 50 85 51 85 OW 

15 57 34 58 35 69 41 53 32 57 34 56 34 55 33 

20 39 23 40 24 51 31 40 24 40 24 40 24 40 24 

25 32 19 33 20 43 26 32 19 36 22 14 20 34 20 

30 24 14 26 16 35 21 24 14 27 16 27 16 27 16 

Table 9 


Conditional Mean (M) and Standard Deviation (SD) 
of the Average Forecast Error 


CLASS I CLASS II CLASS III 


Critical Out- Model 1 Model 2  Model3 Model 4 Model 5  Model6 Model 7 
Value come 61) G11) 012611 022011 B1270614) 19011) 21061,) (21,061,2) 
M SD M SD MSD M SD M SD M SD M SD 


15% Pass 7% 4.0 6% a9 7% 4.1 6% 3.8 7% 3.9 71% 4.0 7% 3.9 
Fail SEG sey PAE 36% 22.5 41% 26.4 36% 21.4 38% 24.5 37% 23.4 37% 23.0 
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its average forecast error is bigger. The forecast errors of model 3 are increased by its over- 
differencing. However, when the forecast errors of model 3 pass the criterion, their average 
is as small as that of the other models. 


4. RANKING OF THE MODELS 


To rank the models, the eight criteria are used at different acceptance levels. Tables 10 
and 11 present the overall and conditional rankings of the models. Table 10 gives the total 


Table 10 
Overall Ranking of the Models 


2 criteria 8 criteria* 8 criteria* 8 criteria* 
FE < 15% FE < 15% FE < 15% FE < 15% 
xy = 5% x = 5% Y= 5% x = 5% 
SP < .10 SPs==05 SPr< 05 
OD = .90 OD = .90 OD = .95 
% of series % of series % of series % of series 


Models’ that passed Models’ that passed Models’ that passed Models _ that passed 


4 52% 1 34% 6 38% 6 39% 
7 51% 6 31% 1 37% 1 38% 
6 49% 5 23% 2 29% 2 29% 
2 48% ) 20% 5 26% 5 28% 
1 44% 3 13% 7 25% ¢) 27% 
3 41% wT) 11% 3 17% 3 19% 
5 32% 4 2% 4 4% 4 5% 
*As well as the four criteria listed, the four other criteria mentioned in the text were imposed. 
Table 11 
Conditional Ranking of the Models 
2 criteria 8 criteria* 8 criteria* 8 criteria* 
FE < 15% FE < 15% FE < 15% FE < 15% 
vm => 5% x? => 5% x’ => 5% x’ => 5% 
SP < .10 SP < .05 SP = .05 
OD = .90 OD = .90 OD = .95 
% of series % of series % of series % of series 


Models that passed Models that passed Models’ that passed Models that passed 


4 52% 1 34% 6 38% 6 39% 
a 9% 3 6% 3 9% 3 9% 
Z 1% 6 4% 7 4% 1 4% 
3 1% 5 2% y) 3% A 2% 


*As well as the four criteria listed, the four other criteria mentioned in the text were imposed. 
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success rate of the models. Table 11 gives first the total success rate of the best model; the 
following models are chosen according to their success with series with which all higher models 
have failed. 

Table 10 shows that: 
¢ when only the chi-square statistic (y) and average forecast error (FE) are used as criteria, 
models 4 and 7, which have the most parameters, rank at the top. 

e on the other hand, the use of all criteria favour the simplest models (models 1 and 6), at 
all levels of small parameter (SP) and overdifferencing (OD) criteria. 

¢ models 1 and 6 usually rank close together, although model 1 has one less parameter than 
model 6. 

¢ when model 6 is not first it is a close second. 

e the more the criteria are relaxed, the higher the pass ratio is, although the ranking of the 
models remains about the same. 

In table 11 we see that: 
¢ when all criteria are used, models 1 and 6 which ranked first and second in table 10 now 
rank only first and third. 
¢ second place belongs to model 3. This model, which in table 10 ranked third, fifth and 
sixth with total success rates of 41%, 13%, 17%, and 19%, here ranks fourth once and se- 
cond three times. This is because model 3 fits well an important family of series (series with 
a steep trend) that all other models fit poorly. 
¢ moving average and autoregressive models are not mutually exclusive. These two families 
of models are complementary and necessary in fitting and forecasting series. 

e when we require only that the average forecast error be less than 15% and the chi-square 
statistic be greater than 5% and nothing else, the combined success rate of models 4, 7, 2 
and 3 together is 63%. 

¢ when all the criteria are used, the models chosen are simple and their combined success 
rate varies between 46% and 54% using the levels of 15% and 5% described just above. 
The success rate depends on the levels of small parameter and overdifferencing used. 

Even though model 1 does not appear in the third column of table 11, it would appear 
there if the level of forecast error permitted were raised to 20%. 

The criteria and levels used in selecting models in figures 1 and 2 are the same as are used 
in the second column of tables 10 and 11, except that in figure 1 the average forecast error 
permitted varies between 10% and 99% while in figure 2 the chi-square criteria varies bet- 
ween 10% and 60%. 

Figure 1 shows that: 
¢ models 1, 3 and 6 perform the best. 

e the ranking of the models tends to remain the same. 

e the performance of the first model increases more rapidly than that of the others, going 
from 23% to 59% compared with an increase from 13% to 17% for model 3. This point 
needs clarification. Model 1 is chosen according to its unconditional performance, while the 
other models are chosen according to their conditional ranking. 

e the increase in performance of the models according to unconditional ranking is greater 
than the increase when using conditional ranking. 

We see in figure 2 that 
¢ models 1, 3 and 6 are generally the best models for any level of chi-square. 
¢ models | and 6 trade places but are not mutually exclusive. 
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Table 12 


Conditional ranking of the ARIMA models for the sectors of the 
Canadian economy 


Models ranking and % of series that passed 


Sectors first % second % third % fourth % 
model model model model 

Labour? api te 4. os. wer eee 1 79 3 14 ~ 0 - 0 
PHic€st@. 009: 400, $f.) Ae. cxeerecmeey 5 50 i 17 e 8 0 
MBI UP ACtURIR gS opcrte tt nereeee bores S 19 6 14 1 5 5 
Fuel, Power and Mining...... 1 46 6 4 ~ 0 ~ 0 
Domesticwlrade sz. aici. Gh 1 53 6 u 7 A - 0 
Extemal mirage naa. ata 6 p4| - 0 - 0 - 0 
Pransportcatione ys se. corer 1 54 5 8 ~ 0 - 0 
BIN aN CO ee ane a cee Van eee 1 32 8) 11 - 0 - 0 


Table 12 presents the conditional ranking of the ARIMA models for those sectors of the 
Canadian economy for which we fitted twelve or more series. The criteria and levels used 
in ranking the models are the same as those used in the second column of tables 10 and 11. 
We see that 
e models 1 and 6 are generally the best performers. 
¢ the combined success rate of the models varies considerably from one sector to another, 
from 93% in the labour sector to only 21% in external trade. 
¢ this success rate is at least 50% for five sectors. The rate depends on the structure of the 
series, changes in the structure, and the amount of irregular in the series. The rate is good 
considering that for two of the last three years Canada suffered a severe recession which 
strongly affected the structure of the series. The success rate for external trade is always low 
because those series are very irregular. 


5. WITHIN-SAMPLE AND OUT-OF-SAMPLE FORECASTS 


The within-sample forecasts are obtained by fitting the models to the entire series in order 
to estimate the parameters and calculate the forecasts for the last three years. The out-of- 
sample forecasts do not use information from after the forecast time origin. For each forecast 
origin, the parameters are re-estimated. 


Table 13 


Failure Rate in Forecast Error for 
Within-Sample and Out-of-Sample Forecasts 


Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 


G1L0G41L0) 614261) 6223601) 242011 41,00,1,1) (2,1,0)0,1,1) @,1, 0) (,1, 2) 
% % % % % % % 
Within-sample 34 35 41 32 34 34 33 


Out-of-sample 31 32 42 33 31 32 31 
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Table 14 
Conditional and Unconditional Ranking of the Models 


Unconditional ranking Conditional Ranking 
Models % of series Models % of series 
that passed that passed 
1 40% 1 40% 
6 28% 2 5% 
5 27% | 4% 
2 20% 3 3% 
3 14% 
ai 10% 
4 2% 


Table 13 shows the rate of failure in forecast error at the 15% level for within-sample 
and out-of-sample forecasts. The difference between the two is small and is well within one 
standard deviation for each model. The X-11-ARIMA seasonal adjustment program uses 
within-sample forecasts because they cost less. 

Table 14 has been prepared using the same criteria and levels as were used in the second 
columns of tables 10 and 11. The unconditional ranking is exactly the same as that in the 
second column of table 10. Only the success rates of the first three models differ, and in 
table 14, model 1 is clearly superior to the other models. However, the conditional ranking 
is different from that appearing in the second column of table 11. 

The conditional rankings in tables 11 and 14 differ for two reasons. First, of course, table 
14 uses out-of-sample forecasts. Another important reason is that the calculation of the seven 
other criteria was based on one year less data, and the missing year contained a severe reces- 
sion. Thus the structure of the series and the choice of models is markedly different. 

It appears therefore that the conditional ranking of the models for both within-sample 
and out-of-sample forecasts depends on the phase of the business or economic cycle in which 
the series ends. 


6. CONCLUSION 


Our objective was to rank a set of seven ARIMA models according to their fitting and 
forecasting of a large sample of time series. 
¢ when only the chi-square statistic and the average forecast error are used as criteria, models 
4 and 7 rank at the top. 
e The use of all eight criteria favours the simplest models (1 and 6) and model 3. 
¢ Models 1 (moving average model) and 6 (autoregressive model) rank close together in un- 
conditional ranking, although model 1 has one less parameter than model 6. 
e In conditional ranking, these two both rank highly but are not mutually exclusive. That 
is, moving average and autoregressive models are complementary and both are necessary 
in fitting and forecasting series. 
e Although Model 3 ranks near the bottom, it fits well an important family of series (series 
with a steep trend) that all other models fit poorly. 
¢ The nonparsimonious models (numbers 4 and 7) have a combined success rate of 61% com- 
pared to a success rate that varies between 44% and 52% for parsimonious models 1, 6 and 3. 
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e The combined success rate of the models varies considerably from one economic sector 
to another, from 93% in the labour sector to only 21% in external trade. This rate depends 
on the structure of the series, changes in the structure, and the amount of irregular in the series. 
e It appears that the conditional ranking of the models for both within-sample and out-of- 
sample forecasts depends on the phase of the business or economic cycle in which the series 
ends. 
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An Empirical Study of Some Regression 
Estimators for Small Domains 


M.A. HIDIROGLOU and C.E. SARNDAL! 


ABSTRACT 


The synthetic estimator (SYN) has been traditionally used to estimate characteristics of small domains. 
Although it has the advantage of a small variance, it can be seriously biased in some small domains 
which depart in structure from the overall domains. Sarndal (1981) introduced the regression estimator 
(REG) in the context of domain estimation. This estimator is nearly unbiased, however, it has two 
drawbacks; (i) its variance can be considerable in some small domains and (ii) it can take on negative 
values in situations that do not allow such values. 


In this paper, we report on a compromise estimator which strikes a balance between the two estimators 
SYN and REG. This estimator, called the modified regression estimator (MRE), has the advantage of 
a considerably reduced variance compared to the REG estimator and has a smaller Mean Squared Er- 
ror than the SYN estimator in domains where the latter is badly biased. The MRE estimator eliminates 
the drawback with negative values mentioned above. These results are supported by a Monte Carlo study 
involving 500 samples. 


KEY WORDS: Small domains; regression estimation; modified regression estimator; bias; mean squared 
error. 


1. INTRODUCTION 


The synthetic estimator (SYN) has the advantage of a small variance, but the following 
disadvantages: (a) it can be badly biased in some domains, and ordinarily we do not know 
which ones; (b) consequently, a calculated coefficient of variation (cv), or a calculated con- 
fidence interval, is meaningless for such domains. 

For the same model that underlies the SYN estimator one can create a nearly unbiased 
analogue, the generalized regression estimator (REG), which has the additional advantage 
that a standard design based confidence interval is easily computed for each domain estimate. 
A disadvantage with REG is that the estimated variance (and hence the cv and the width 
of the confidence interval) can be unacceptably large in very small domains. (This is, of course, 
a direct consequence of the shortage of observations in such domains.) Also, the REG can 
(although with small probability) take negative values in situations where such values are 
unacceptable. 

It is therefore desirable to strike a balance between SYN and REG. Here, we report on 
an empirical study with one such compromise estimator, the modified regression estimator 
(MRE). It has a small (but noticeable) bias in those domains where the synthetic estimator 
is greatly biased; in other domains, the MRE is nearly unbiased. The MRE has the advantage 
of a considerably reduced variance compared to the REG estimator. In addition, the MRE 
has a smaller Mean Squared Error than the SYN estimator in domains where the latter is 
badly biased. Meaningful confidence intervals can also be easily constructed for the new MRE 
estimator. 


!M.A. Hidiroglou, Business Survey Methods Division, Statistics Canada, 5-C8, Jean Talon Building, Tunney’s 
Pasture, Ottawa, Ontario, Canada KIA OT6 and C.E. Sarndal, Department of Mathematics and Statistics, 
University of Montréal, Montréal, Québec, Canada H3C 3J7. 
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The paper is structured as follows. In Section 2, some of the commonly used estimators for 
small areas such as the direct, post-stratified and synthetic estimators are reviewed as well as some 
of the regression estimators given by Sarndal (1981, 1984). In Section 3, the proposed modified 
regression estimators are introduced and discussed. In Section 4, the properties of the modified 
regression estimators as well as some of the other estimators are studied through a Monte Carlo 
simulation using business tax data. Finally, Section 5 provides some general conclusions. 


2. ESTIMATORS 


Let the population U = {1, ..., k, ..., N} be divided into D non-overlapping domains 
U,, ..., U,, ..., Up. Let N, be the size of U,. (In our empirical study, the domains are 
defined by a cross-classification of 4 industrial groupings with the 18 census divisions in the 
province of Nova Scotia. There were D = 70 non-empty domains, as described in Hidiroglou, 
Morry, Dagum, Rao and Sarndal (1984).) 

The population is further divided along a second dimension, into G non-overlapping 
groups Uns. att oly ied. esnUeg! 

The size of U, is denoted N,. (In our study, the groups are based on Gross Business In- 
come classes.) The cross-classification of domains and groups gives rise to DG population 
CellseU ea ithe. « DD, 2 — ty as Os Let ING, DO-tne SIZ Ole 7 

Then the population size N can be expressed as 


D G Dawe 
shee ahs aie ae hah: (2.1) 


1 & 


Let s denote a sample of size n drawn from U by simple random sampling (srs). Denote 
by sz, S, and s,, the parts of s that happen to fall, respectively, in U,, U, and U,,. 

The corresponding sizes, which are random variables, are denoted by n,, n, and nj. 
Note that (2.1) holds for lower case n’s as well. The variable of interest, y (= Wages and 
Salaries) takes the value of y, for the k:th unit (= unincorporated business tax filer). The 
auxiliary variable x (= Gross Business Income) takes the value x, for the k:th unit, and x, 
iscknOWni tol, all kes lea oN 

The following estimators of the domain total t, = Yy , Ye are compared, where Vy, 
denotes the summation over the units in U,. 


The straight expansion estimator (EXP): 


faxe = Ee (2.2) 
The poststratified estimator (POS): 
tapos a NaIs5 (2.3) 
where 
My 
Sidi, 


is the mean of the n, y-values from the d:th domain. If n, = 0 we define the POS 
estimator to be zero (somewhat arbitrarily, since strictly speaking the estimator is then undefin- 
ed). Neither the EXP nor the POS estimator are particularly advantageous. They serve main- 
ly as benchmarks against which the behaviour of the following more efficient estimators will 
be compared. 
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Two versions of the SYN and REG have been investigated, the ‘‘Count’’ version and the 
‘Ratio’ version. The SYN estimator is based on the assumption that a given model holds 
for each group g. For the ‘‘Count’’ version a given model would lead to the assumption 
that the mean of each group is the same across all domains d. For the ‘‘Ratio’”’ version, the 
implied model would be that the ratios of a given variable of interest over an auxiliary variable 
would be constant within a given group across all domains. If the assumption of homogeneity 
of domain characteristics does not hold within each group, the SYN estimators can be very 
biased. The REG estimation method as given by Sarndal (1984) is motivated by the follow- 
ing requirements: (a) to obtain approximately design-unbiased estimates with simple variance 
estimates and easily calculable (and meaningful) confidence intervals; (b) to strengthen the 
estimates by involving sample data from all domains. 

The formulas for the ‘‘Count’’ versions are: 


Synthetic-Count estimator (SYN/C): 
a G 
lasyN/c = y Nacds . (2.4) 
g= 


where J, _ is the mean of yins.,. 
Sg a4 


Regression-Count estimator (REG/C): 
P G . 
larEG/c = ye {NiJs , + Nac(Vsag — 9. (2.5) 
g= 
where J,,, is the mean of y in s4,, and Ny, = Nng,/n. Here, L 7-1 Nag(Vsq. — Js) is a bias 


correction term that ordinarily carries a considerable variance contribution. 
The ‘‘Ratio’’ versions of the SYN and REG estimators are: 


Synthetic-Ratio estimator (SYN/R): 
a G 
lasyn/R = L Xaghe (2.6) 


with X,, = Y Uae Xk and 


Regression - Ratio estimator (REG/R): 


_ G a B z 
LaREG/R = y {X,,R, + NacVsag — ieee) (2.7) 
g= 


3. MODIFIED REGRESSION ESTIMATORS 


Regression estimators introduced by Sarndal (1984) were constructed by fitting a regres- 
sion model to some auxiliary variables and using the resulting fitted model to create predicted 
values for the units in the population domain. Assuming that the sampling design, p, is an 
arbitrary one (not necessarily srs) with inclusion probabilities 7, (first order) and mj, (second 
order), let the regression model be given by 


E,0,) = x48; Vi“) = 


where the y, are independent random variables. An estimator of f is 
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where it is assumed that the v, are known to multiplicative constant(s) that cancel when B 
is derived. 

Following Sarndal (1984), a nearly unbiased estimator of the unknown d-th domain total 
is given by 


trea = Y Se + ¥ ae (3.1) 
Ug. Sd. Wr 
where §, = x/@ is the k-th predicted value and e, = y, — §, denotes the k-th residual. 
We shall refer to ¥ y «as the synthetic term of the estimator tro and the second term, 
Y sq Ck/T Will be called the correction term. 
If s, is non-empty, an approximately unbiased alternative to the REG estimator (3.1) is 


given by 
ex 


hala 
A a Ss, T 
DN fla nef es (3.2) 
Ug. Na. 
where 

es 1 
Na. ir 3 = 
sa, Tk 


is the estimated domain size. 
The correction term now appears in the form of a ratio estimator, 


multiplied by the known domain size N, (obviously, Nz, is known since the cell counts N,, 
are known). 

The size n, being random, the ratio form will serve to reduce the variance of the cor- 
rection term. The effect will be particularly noticeable in domains where the average of the 
residuals is clearly away from zero (that is, in domains where the model does not fit well). 

If the expected sample take in the domain, EF, = E,(nz) = Yu, were substantial (say, 
E, = 50), then it is practically certain that the realized sample take, n,, will not be 
exceedingly small. For example, under srs, values n, < 30 will hardly ever occur. In such 
situations, the nearly unbiased estimator (3.2) can be recommended as is. It should realize 
important efficiency gains over (3.1), notably in domains where the model does not fit as 
well. But in practice one often encounters domains that are so small that the expected sam- 
ple take E,, does not exceed 5. This is true for a number of domains in our study. In such 
cases, realized sample takes n, between zero and five are very likely. Our empirical work 
has confirmed the intuitively obvious fact that the residual correction will, in these small 
domains, contribute greatly to the variance, whether the correction appears in its straight 
form, Yi 5, Cx/M as in (3.1), or in its ratio form , Na(Y., C/T (YL 5, 1/7), as in (3.2). 

To counteract this inflated variance contribution, we modify the correction term of (3.2) 
in a way implying that we settle for a small bias (in domains where the model fits less well) 
in exchange for a reduced variance contribution when the realized sample take n, is lower 
than expected (and it is assumed that the expected sample take is already low in itself). 
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The form of the new correction term will be determined by the relation between realized 
sample take n,, and expected sample take E,;. The correction term },, e/a, will be 
multiplied by (N,/N,) when n, < E, and by (N,/N,) otherwise. The resulting correction 
term using this adaptive ‘‘dampening factor’’ will have the effect of not ‘‘over-correcting”’ 
the synthetic term when some of the residuals e, behave as outliers for small n,’s. The 
“‘over-correcting’’ may have the effect of greatly underestimating a domain d, yielding negative 
values when only positive values are acceptable, or conversely greatly overestimating the 
domain. 

The resulting estimator, the modified regression estimator (MRE), incorporating these 
two types of realizations of n,, is 


Le i eset (3.3) 


Sq. Wr 
where 


e when n, = E, 
F, = d. 


A 


Na 
N, when n, < E, 


It can be shown that (3.3) is nearly unbiased conditionally on n,, as long as n, = E;. 
For n, < E,, the MRE has some conditional bias, which tends to increase the more n, falls 
short of its expected value. At the same time, the MRE estimator is being pushed towards 
its synthetic term, thus benefitting from the stability (low variance) of the synthetic term. 
Unconditionally, the MRE estimator given by (3.3) will have a certain small bias, but a much 
reduced variance compared with the REG estimator. 

We note a final point in favour of MRE estimator. As a result of its considerable variance 
in very small domains, the REG estimator will, with a small but positive probability, take 
values extremely removed from the true value ¢,. The value of the REG may even be 
negative, which is, of course, unacceptable for a variable (such as Wages and Salaries) which 
is by definition non-negative. Negative values of the REG estimate can occur when there 
exists large negative residuals e, in the correction term of (3.1), and are especially likely 
when n, < E,. The new MRE estimator virtually eliminates this occurence of negative 
estimates. In practice, if by a remote possibility the MRE takes a negative value, we recom- 
mend to redefine the MRE estimator as being equal to the always positive SYN estimator. 

A natural formula for estimating the variance of (3.2) is 


BN ER N % (a) > €, MEr— e)) 
D(liss) = ta] a (3.4) 
Ni kee TW, 
where » 
k 
Sd. = 1 
sq, "k 
and 
1 _ TK if = 
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We propose that the same formula may serve well to estimate the variance of the MRE 
estimator (3.3). It is true that (3.3) differs from (3.2) when the realized sample take falls 
short of the expected sample take; however, it is not foreseen that the difference will be great 
enough to cause serious distortion in the validity of a confidence interval for ¢, centred on 


timer USing (3.4) as the estimated variance. 


In the case of simple random sampling, and assuming for g = 1, ..., G, 


E.O,) = Bs VO%) = O53 Ke U4, 


we find 


leading to the ‘‘Count estimator’? whose modified version (MRE/C) is 


~ G a 
LamRE/C = yo {Nase + FuNagVae 7 I. Dt 
ri 


where E, in the formula for F, is now given by 


nN, 
Ey a Es(Na.) ra ii 


with 
- N 
Na = ral) 
and 
Dales 1) 
sa tiges fOr fies al 
stg = Nag 
0 otherwise. 


@.5) 


(3.6) 


The MRE/C estimator will have some bias, which is, however, ordinarily much less than 


that of the SYN/C estimator. 


The underlying model assumptions which lead to the ‘‘ratio estimator’’, whose modified 


version is denoted as MRE/R, are for g = 1, ..., G, 
E.O,) = BXx3 V.0O%) = OX, K€ U,,. 


The MRE/R estimator is then, in the case of simple random sampling, 


A G rs ix A 
taMRE/R = y {XGR: i FuNagVsag + R,X,,,) 
g= 


(3.7) 
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where 


D 
? ae Vien 
Say er per ae 
Eat 

and 
Ge Y Ugg Xk: 


Drew, Singh and Choudhry (1982) provided small domain estimators which, although not 
derived by a regression approach, have some similarity to the ones given in this paper. Their 
*‘count’’ version is 


tnnoic = L Na {Wi3;, + 1 = Wi), -} (3.8) 
77 g 8 
while their ‘‘ratio’’ version is 
i. sae : Js. 
laxNo/R = Dy X ag Aerie sal henry 4 oe z (3.9) 
Sig 
where 
Nag 
— ifn Ee 
West cena Wee 


1 otherwise 


with E,, = n(N,,/N). In the present context, if Wj, in (3.8) is replaced by 


n n : 
ENB) itm < Es 


& 


E, a] : 
—}—]| if =F 
\e a . 


dg 


we obtain faypeyc: 


4. RESULTS FROM THE EMPIRICAL STUDY 


In order to study the properties of the estimators discussed in the preceding sections, a 
simulation was undertaken. The province of Nova Scotia was chosen as our population with 
N = 1678 sampling units (unincorporated tax filers). The variable of interest, y, is Wages 
and Salaries. We use a single auxiliary variable, x, namely, Gross Business Income. It is assum- 
Oo that. %). ..-..5.%n are Known. 

Domains of the population were formed by a cross-classification of four industrial groups 
with eighteen regions. The industrial groups were Retail (515 units), Construction (496 units), 
Accommodation (114 units) and Others (553 units). The overall correlation coefficients bet- 
ween Wages and Salaries and Gross Business Income were 0.42 for Retail, 0.64 for Con- 
struction, 0.78 for Accommodation and 0.61 for Others. The regions were the 18 Census 
Divisions of the province. This produced 70 non-empty domains (out of the four times 18 
domains, two combinations had no units). Thus, 70 domain totals t, are to be estimated 
every time a sample is drawn. 
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For the Monte Carlo simulation, 500 simple random samples, s, each of size n = 419, 
were selected from the population of N = 1678 units. The selected sample units were classified 
into type of industry and Census Division. The population could have been divided along 
a second dimension, say income groups. But for the purposes of this study, all the taxfilers 
were considered as belonging to one income group (G = 1). 

The results are summarized for each small area within the industrial groups RETAIL and 
ACCOMMODATION using tables and graphs. For the tables (1-4), summary statistics are 
the relative conditional bias and mean squared error. The eight graphs, one for each of the 
eight estimators, are given in figure 1. In each graph, there are eighteen vertical ‘distribution 
bands’, one for each of the eighteen Census Divisions for the industrial group RETAIL. 
The upper and lower points of each distribution band correspond, respectively, to the 90:th 
and 10:th percentile of the distribution of the 500 values of (¢, — t4,)/tg,. Consequently, a 
distribution band placed roughly symmetrically about the zero line indicates that the cor- 
responding estimator is approximately unbiased for the domain of interest; otherwise, the 
estimator is biased for the domain. The shorter the band, the smaller the variance of the 
estimator in the domain. The abscissa measures the mean sample take for the domain. 

From the tables and graphs, the following conclusions emerge: (where conclusion C states 
the main new results, whereas A and B resume what is known from earlier work Sarndal 
and Raback (1983); Hidiroglou et al. (1984)). 


A. The SYN/C and SYN/R estimators are badly biased in some domains, namely, in 
those domains where the underlying model fits poorly. However, they consistently 
have an attractively low variance, compared to the other alternatives. The Mean 
Squared Error of the two SYN estimators will consequently be very large in do- 
mains with large bias (poor model fit); by contrast, the Mean Squared Error is 
small in domains with little bias (good model fit). 


B. The REG/C and REG/R estimators are essentially unbiased. Their variance, 
although usually much lower than that of the EXP and POS estimators, is con- 
sistently much higher than that of the SYN/C and SYN/R estimators. In the 
smallest domains, none of the unbiased estimators (EXP, POS, REG/C, REG/R) 
is attractive from the variance point of view; this is especially true for the REG 
estimators. This problem is remedied by the two MRE modifications of the REG 
estimators. 


C. The two MRE estimators, MRE/C and MRE/R, are negligibly biased when the 
SYN estimators happen to be nearly unbiased (e.g., RETAIL, area 17); otherwise 
the MRE estimators have a certain bias, which, however, is ordinarily much less 
pronounced than that of the SYN estimators (e.g., RETAIL, area 2). The MRE 
estimators have considerably smaller variance and Mean Squared Error, in all 
domains, than the REG estimators. This tendency is particularly pronounced in 
the smaller domains. In comparison with the SYN estimators, we find that the 
MRE estimators (as expected) still have a larger variance in virtually all domains. 
However, the Mean Squared Error of the MRE estimators is smaller than that 
of the SYN estimators in domains where the latter are badly biased. In Table 6 
we see, for example, that the MRE/R estimator has a smaller Mean Squared Error 
than that of the SYN/R in 9 out of 16 small areas. The obvious explanation is 
that in domains where the SYN estimator is greatly biased, the (bias)’ constitutes 
an extremely large contribution to the Mean Squared Error of the SYN, whereas 
for the MRE estimators, the (bias) is not very important. Since we do not know 
which domains create the large biases, the goal of producing reliable estimates 
in all domains is on the whole better served by the MRE method of estimation. 


Survey Methodology, June 1985 73 


Table 1 


Mean Sample Take and Relative Bias of Each of Eight Estimators over 
500 Repeated Simple Random Samples from the Entire Population 
Industrial Group: RETAIL; 18 Census Divisions in Nova Scotia. 


Mean Estimator 
Area Sample EXP POS SYN/C MRE/C REG/C SYN/R MRE/R- REG/R 
Take 
ik 1.76 —0.02 —0.13 0.12 0.02 — 0.03 0.30 0.09 — 0.02 
2 5.45 0.00 —0.04 — 0.36 —0.10 — 0.02 — 0.27 — 0.08 — 0.02 
3 3.90 —0.02 0.01 — 0.08 — 0.02 0.00 —0.01 — 0.01 0.00 
4 3.02 0.01 —0.05 OS 0.05 0.01 0.13 0.04 0.04 
5 5.93 0.00 0.01 0.21 0.05 0.00 0.13 0.03 0.00 
6 7.63 —0.02 —0.01 0.28 0.07 0.01 0.10 0.02 0.00 
q| 8.61 0.02 0.01 —0.16 — 0.03 0.01 —0.18 — 0.03 0.01 
8 5.64 —0.02 —0.01 0.34 0.10 0.03 0.24 0.06 0.01 
9 24.64 0.00 0.00 — 0.02 0.00 0.00 —0.01 0.00 0.01 
10 8.92 —0.02 —0.02 OSS 0.02 —0.01 0.09 0.00 —0.01 
11 8.35 -—0.03 —0.02 0.08 0.01 0.00 0.10 0.02 0.00 
12 10.58 0.01 0.00 — 0.27 — 0.05 0.00 —0.18 — 0.03 0.00 
13 0.48 —0.04 —0.58 0.61 0.36 0.04 1.00 0.58 0.04 
14 2.80 0.03 —0.03 0.33 0.11 0.00 0.24 0.10 0.02 
15 4.21 0.06 —0.01 0.28 0.06 0.00 0.30 0.07 —0.01 
16 2.24 0.03 —0.05 0.74 0.26 0.03 0.94 0.32 0.02 
17 23.95 —0.01 —0.01 — 0.02 0.00 0.00 — 0.05 —0.01 0.00 
18 0.54 0.07 —0.54 0.63 0.34 — 0.06 0.67 0.35 — 0.06 
Table 2 
Mean Squared Error of Each of Eight Estimators over 500 Repeated Simple 
Random Samples from the Entire Population 
Industrial Group: RETAIL; 18 Census Divisions in Nova Scotia. 
Estimator 
Area EXP POS SYN/C MRE/C REG/C SYN/R MRE/R REG/R 
1 3,209 2,206 96 697 1,397 462 769 1,484 
2 42,598 24,623 21,782 DET 25 17,358 13,110 10,256 14,380 
3 10,469 6,853 357 2,592 4,212 146 2,333 3,782 
4 5,626 3,657 324 746 1,186 251 1,206 1,853 
5 14,554 9,681 2,999 5,090 7,360 1,294 3,993 5,974 
6 12,308 5,686 6,713 3,423 4,289 1,255 1,747 2515 
7 34,865 17,988 6,912 9,387 13,451 8,161 12,019 17,239 
8 12,066 8,630 5,772 3,694 5,045 2,981 3,528 4,986 
9 72,974 40,440 5,776 24,025 29,250 5,068 21,292 25,832 
10 22,091 9,433 4,559 5,832 7,927 2,009 5,365 5 VAP) 
11 23,519 125505 1,778 6,738 9,578 2,348 7,890 11,063 
2 46,588 21,874 35,310 13,558 17,084 17,454 F222 16,514 
13 635 244 161 95 228 422 287 783 
14 3,871 2,849 692 1,254 2,141 378 1,373 2,346 
15 8,088 Shey hl 2,249 1,892 2,806 2,601 1,985 2,937 
16 3,245 2 V2 3,316 1,563 2,516 5,333 1,741 2,654 
17 81,211 47,753 5,503 28,957 S232 7,681 27,457 33,136 


18 1,003 306 169 187 654 186 184 637 
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Table 3 


Mean Sample Take and Relative Bias of Each of Eight Estimators over 
500 Repeated Samples from the Entire Population 
Industrial group: ACCOMMODATION; Areas: 16 Census Divisions in Nova Scotia. 


Mean Estimator 
Area Sample’ - EXP. (2.POS, OSYN/C’\ IMRE/C)\ WV REG/G@MiSYNZR ... MRE/R.. REGZR 
Take 
1 0.25 0.01 —0.75 — 0.08 — 0.06 —0.01 0.36 0.28 0.01 
» 1.37 —0.06 —0.21 0.25 0.10 0.02 0.25 0.11 0.02 
3 1.02 0.06 —0.26 0.19 0.09 0.04 0.12 0.06 0.03 
4 0.23 —0.10 —0.77 — 0.33 — 0.26 — 0.07 —0.15 — 0.13 — 0.05 
5 2.04 0.03 —0.13 0.21 0.08 0.03 0.18 0.06 0.01 
6 1.49 0.04 —0.13 0.17 0.10 0.03 0.03 0.02 0.01 
L 1.53 0.01 —0.18 —0.29 —0.11 — 0.01 — 0.30 —0.12 — 0.02 
8 1.54 0.03 —0.19 — 0.42 —0.17 —0.01 — 0.26 —0.11 — 0.02 
9 6.83 0.01 —0.02 0.13 0.02 0.00 0.12 0.02 0.00 
10 1.26 —0.01 —0.26 0.40 0.17 0.03 0.30 0.13 0.02 
11 3.06 0.04 —0.02 0.51 0.21 0.08 0.40 0.16 0.06 
12 1.80 0.02 —0.16 — 0.08 — 0.05 —0.03 —0.23 —0.10 — 0.03 
14 1.04 0.02 —0.33 — 0.52 —0.23 — 0.07 — 0.32 —0.15 — 0.06 
15 1.54 —0.03 —0.23 =0.21 —0.13 — 0.08 —0.15 —0.11 — 0.08 
17 3.08 —0.07 —0.05 — 0.03 —0.01 0.00 —0.14 — 0.07 — 0.03 
18 0.52 0.04 —0.54 3.26 3.20 0.60 2.97 2.92 0.50 
Table 4 
Mean Squared Error of Each of Eight Estimators over 500 Repeated Simple 
Random Samples from the Entire Population 
Industrial Group: ACCOMMODATION; Areas: 16 Census Divisions in Nova Scotia. 
Estimator 
Area EXP POS SYN/C MRE/C  REG/C SYN/R - MRE/R_~ REG/R 
1 1,142 283 9 | 25 58 44 164 
2 7,467 5,082 877 631 077 747 455 726 
3 878 442 48 163 242 24 116 163 
4 155 43 i | 6 17 3 3 6 
5 15,200 8,392 2,091 2,270 3,230 12a 1,208 1,785 
6 5,239 3,906 253 1,038 2,193 54 396 792 
ih 21,197 8,781 3,569 1,831 3,016 3,709 1,812 2,948 
8 14,071 6,738 3,608 23122 4,018 1,492 947 1,766 
9 50,606 27,867 9,980 11,413 14,344 6,575 7,779 9,991 
10 2,219 993 590 362 665 S17 Lot 280 
11 10,535 5,774 6,366 5,126 7,154 3,867 Pid 92 3,673 
12 16,787 10,485 543 1,148 1,944 1,245 1,130 1,836 
14 51,471 25,644 9,669 8,221 14,155 3,972 3,189 5,077 
15 59,207. = 41,381 4,861 10,548 18,119 2,759 4,262 6,636 
17 29:632-)) ©25, 211 1,501 3,023 4,754 1,765 95 9p} 3,214 


18 286 99 2,062 2112 5,623 1,607 1,646 4,561 
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Figure 1: Distribution band of relative error for selected estimators — abscissa represents mean sample 
take. Industrial Group: RETAIL. Areas: 18 Census Divisions in Nova Scotia. 
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Figure 1 (continued) 
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5. CONCLUSIONS 


In summary we find that the overall performance of the MRE estimators is such that we 
suggest them as promising alternatives for future applications of small area estimation. The 
recommended confidence interval procedure based on the MRE estimators is given in sec- 
tion 3. 

We think that the MRE method presented here involves a simple mechanism for steering 
the estimates slightly in the direction of the stable SYN estimators, when the sample take 
is less than expected. This goal is also manifested (but attained by different means) in such 
other attempts as the empirical Bayes (Fay and Herriot, 1979) and sample-dependent (Drew, 
Singh, and Choudhry 1982) methods of estimation. 
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1981 Census of Agriculture 
Data Processing Methodology 


DAVID K. HOLLINS! 


ABSTRACT 


This paper presents an overview of the methodology used in the processing of the 1981 Census of 
Agriculture data. The edit and imputation techniques are stressed, with emphasis on the multivariate 
search algorithm. A brief evaluation of the system’s performance is given. 


KEY WORDS: Edit and imputation; Multivariable searches 


1. INTRODUCTION 


This paper presents an overview of the methodology used in the processing of the 1981 
Census of Agriculture data. There are 3 separate phases to the processing of the data: Data 
Entry, Edit, and Imputation, each of which performs a different function. First, in Data 
Entry, data on the questionnaires are keyed onto a computer data file. Then, in the Edit phase, 
computer edits are applied to the keyed data records in order to detect any inconsistent, miss- 
ing, or suspicious entries. In the final phase, Imputation, actions are taken to adjust the data 
records so that they conform to the rules defined by the computer edits applied during Edit. 
The methodology involved in each of the three phases of processing is described in subsequent 
sections of this paper. A flow chart of the 1981 Census of Agriculture processing is given in 
Figure 1. 


Pre-Grooming 


Key Entry (Data Input) 


Imputation 
Data Validation 


Figure 1. Overall Process Flow 


! D.K. Hollins, Census and Household Survey Methods Division, Statistics Canada, Tunney’s Pasture, Ottawa, 
Ontario, Canada K1A OT6. 
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The 1981 Census of Agriculture required that the same questionnaire be completed by each 
farm operator in Canada. The questionnaire is 8 pages long and consists of 134 questions. 
Questions are asked on all aspects of farm operation, including items such as types of crops 
grown, livestock raised, equipment maintained, and types of land use. Operators are required 
to answer only those sections of the questionnaire which apply to their holding. 

As this paper is an overview, it is not possible to delve into the technical computer aspects 
of the Census of Agriculture processing. These details may be found in Shields and Yiptong 
(1981), on which this paper is based. 


2. DATA ENTRY 


In the Data Entry phase the Census of Agriculture data are transferred from the original 
questionnaires to a data file in computer memory. Data entry is comprised of two stages: a 
clerical pre-grooming process (Pre-Scan), and Key Entry. 

After the questionnaires arrive at head office for processing, a clerical pre-grooming process 
known as Pre-Scan is performed. In this process, a clerk scans each questionnaire for response 
irregularities such as unreadable entries, ditto marks, and responses in incorrect locations. If 
valid responses can be discerned, they are recorded in the appropriate locations, if not, the 
questionnaire is left unchanged. 

Next, in Key Entry, the data on each questionnaire are keyed into the computer. Identifying 
information from the front page of the questionnaire is entered in a standard fixed format. 
However, since farm operators are required to answer only the sections of the questionnaire 
that apply to their holding, a large portion of the questionnaire remains blank. To reduce key- 
ing time, a method known as “string-keying” is used to enter the remaining data. This means 
that the field name is keyed, immediately followed by the data value for that field. Only fields 
with existing data values are keyed; unanswered portions of the questionnaire are not. Because 
of the sparseness of the data, this method results in significant savings in keying time required. 

The Key Entry process creates one Edit and Imputation Master File (EIMF) record for each 
of a total of approximately 320,000 questionnaires. There are 244 fields on an EIMF record, 
each identified by a name, generally 6 characters in length. The Key Entry operator is instructed 
to key ‘“#” for any unreadable entries. If possible, a clerical correction will be performed on 
records containing this symbol during Edit, otherwise, the records will be corrected during 
imputation. 


3. EDIT 


The Edit phase serves two purposes. The first is to use computer edits to detect any incon- 
sistent, missing, Or suspicious entries in the data. The second is to perform a clerical correction 
on the defective records, or if that is not possible, then to pass the defective records on to be 
fixed during Imputation. A flow chart of the Edit process is given in Figure 2. 

There are 3 components to the edit system: two computer edit cycles called Correction Cycles 
#1 and #2, and a cycle for correcting edit failures, called Correction of Rejects. Correction Cy- 
cle #1 (CC #1) consists of those edits that detect conditions that prevent the “de-stringing” (the 
conversion from string format to fixed format) of the keyed record (decode edits), and those 
edits that detect errors in the geographic and identifying information from the front page of 
the questionnaire (ID edits). Correction Cycle #2 (CC #2) consists of those edits that identify 
inconsistencies in the main body of the data (data edits). Correction of Rejects is a clerical 
process during which both CC #1 and CC #2 edit failures are corrected manually. Edit failures 
that cannot be corrected by Correction of Rejects are passed on to Imputation. 

Each of the EIMF records is processed through the edit system individually. 
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3.1 Correction Cycle #1 (Decode and ID Edits) 


Correction Cycle #1 consists of the application and resolution of two sets of edits: the decode 
edits and the ID edits. 

The decode edits are applied first and if conditions exist that prevent the “de-stringing” of 
the data record, then decode edit failures will result. For example, as no two fields should have 
the same identifying characters, “de-stringing” will be prevented if two field names are keyed 
identically. 

Any failed decode edits are resolved manually by the Correction of Rejects staff. This in- 
volves returning to the questionnaire to determine the cause of the edit failure, then the rekey- 
ing of the relevant data. After an attempt is made to resolve a decode edit failure, the EIMF 
record is re-edited by passing it through the decode edits again, forming a continuous cycle 
between the decode edits and the Correction of Rejects staff. This cycle is repeated until there 
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are no decode edit failures remaining on the EIMF record. If a decode edit cannot be resolv- 
ed directly, the most appropriate valid interpretation of the available data is employed as 
a final override. 

After all decode edit failures have been resolved, the ID edits are applied. If any of the 
identifying information on the EIMF record is inconsistent or missing, then one or more 
ID edits will fail. These ID edit failures are resolved in an identical manner to the decode edits. 

Once all of the CC #1 (decode and ID) edit failures have been resolved by the Correction 
of Rejects staff, the EIMF record is passed through the CC #2 edit program. 


3.2 Correction Cycle #2 (Data Edits) 


The data edits (CC #2) are used to detect errors in the main body of the questionnaire, 
as opposed to errors in coding, or in identifying information. There are two types of data 
edits: non-mandatory edits (75), and mandatory edits (24). 

Non-mandatory edits are written to detect suspicious entries on the EIMF data records. 
Generally, non-mandatory edits, detecting variable values falling outside prescribed limits, 
are performed by comparing different fields or groups of fields on the questionnaire to deter- 
mine if some data values are abnormally high or low in comparison with others. For exam- 
ple, a record with total farm area equalling 10 acres and containing 10,000 cattle would be 
flagged by a non-mandatory limit edit. 

Mandatory edits are written to detect logical impossibilities on the data record, e.g., if 
the total number of cattle reported is not equal to the sum of the reported values for each 
of the different cattle types, then a mandatory edit would fail. The most complex mandatory 
edits are those written for the crop section of the questionnaire. 

To resolve a non-mandatory edit failure, the record is sent to a Correction of Rejects clerk. 
The Correction of Rejects clerk first notes whether or not the edit failure is due to a keying 
error. If it is, the relevant data is rekeyed. If it is not, the clerk scans the questionnaire to 
see if the respondent has written any comments on the questionnaire that may explain the 
reason for the edit failure. For example, if the respondent is instructed to answer a question 
in tons, and tons has been crossed out and pounds written in, the response will probably 
fail a non-mandatory limit edit. In this case, the Correction of Rejects clerk will convert 
the response from pounds into tons. If the Correction of Rejects clerk can find no explana- 
tion for the edit failure, the respondent’s answers are left intact on the EIMF record and 
are indicated acceptable. Although no changes are made to the data on the EIMF record, 
this is known as ‘‘force-fitting’’ the data. 

Mandatory edit failures are handled somewhat differently to non-mandatory edit failures. 
To resolve a mandatory edit failure, the failed record is sent to a Correction of Rejects clerk 
who proceeds at first in an identical manner to that used in the resolution of non-mandatory 
edit failures. However, if no explanation for the edit failure can be found, instead of ‘‘force- 
fitting’’ the edit failure, the record is flagged for computer imputation. 

As in CC #1, there is a continuous cycle between the Correction of Rejects staff and the 
CC #2 edit program. After each attempt is made to resolve a CC #2 edit failure the EIMF 
record is re-run through the CC #2 edit program. Unlike CC #1, however, the Correction 
of Rejects clerk has only 3 attempts to resolve the CC #2 edit failures on a given EIMF record. 
After the third attempt, the CC #2 edit program is run once again. Any remaining non- 
mandatory edit failures are marked ‘‘force fit’’ and any remaining mandatory edit failures 
are marked ‘‘impute’’. The mandatory edit failures are simply flagged at this stage. The par- 
ticular fields requiring imputation are identified at the imputation stage. 
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Figure 3. Imputation Process Flow 


4. IMPUTATION 


The purpose of the 1981 Census of Agriculture imputation system (see Figure 3) is to resolve 
edit failures on the EIMF data records. As all non-mandatory edit failures are ‘‘force-fit’’ 
as described in the previous section, only the mandatory edit failures remain to be resolved 
by the imputation system. In order to make the EIMF data records conform to the man- 
datory edits, specified ‘‘imputation actions’’ are performed. These imputation actions (IA’s), 
of which there are over 100, are designed so that as few fields as possible are changed on 
the EIMF record, e.g. totals are always adjusted to equal the sum of the parts, rather than 
the parts being adjusted to total the sum. Each IA has associated with it the appropriate 
imputation processing control information and is selected based on the field or fields requir- 
ing imputation. There are two different types of IA’s performed: internal IA’s, or deter- 
ministic corrections, and donor IA’s. 
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4.1 Internal Imputation Actions 


Internal IA’s are performed in cases where sufficient data exists on the failed record to enable 
the imputation system to provide a deterministic correction for the inconsistent field(s). These 
internal [A’s are performed in cases where the inconsistent field(s) is (are) deterministically depen- 
dent on other fields not requiring imputation. For example, an internal IA would be performed 
if a respondent reports quantities for the various types of cattle but neglects to report the total 
number of cattle. In this case, total cattle would be calculated using the sum of the quantities 
reported for the various types of cattle. Another situation in which an internal IA would be 
performed is where a respondent reports a certain quantity of a particular type of fruit tree 
but neglects to give the corresponding acreage. In this case, the acreage would be computed 
using a predetermined average density for that type of fruit tree. Internal IA’s are performed 
in accordance with constraints to ensure that the imputed values are within reasonable bounds. 

The implementation of internal [A’s is more straightforward than that of donor IJA’s. As 
the internal IA is performed using data from the same record, there is no need to specify an 
algorithm for donor selection. The only requirement is to perform the deterministic correction 
specified by the appropriate internal IA. All internal IA’s are performed before proceeding to 
donor imputation. 


4.2 Donor Imputation Actions 


When the inconsistent field or fields are not deterministically dependent on other consistent 
fields, internal [A’s cannot be applied. The lack of sufficient information on the failed record 
to provide a deterministic correction to the inconsistent field(s) necessitates an imputation method 
using data contained on another record. This method, known as donor imputation, involves 
the transfer of data from a “clean” donor record (one which has passed all mandatory edits) 
to the failed record. The transferred data will restore consistency to the inconsistent field(s) 
on the failed record. For example, a donor IA will be performed in order to estimate the distribu- 
tion for types of cattle when only the total number of cattle is reported. In this case, the distribu- 
tion of cattle types present on the donor record is transferred to the failed (recipient) record. 

As donor imputation requires an algorithm for locating a donor record, it is more complex 
to implement than internal imputation. In order to perform donor imputation, several search 
“parameters” must be specified. 

To ensure that a “clean” donor record is geographically close to the “bad” recipient record, 
the country is divided into distinct geographical regions called imputation regions. The delinea- 
tion of these imputation regions is based on the existing “crop district” boundaries which are 
defined according to characteristics such as soil type and climate. There are 59 crop districts, 
and thus 59 imputation regions, in Canada with an average of 5,500 farms per region. In order 
to be an eligible donor, a record must be in the same imputation region as the recipient record. 

In order to avoid searching records that cannot donate suitable data, each donor IA also 
specifies the subpopulation on which the donor search is to take place. For example, if the 
distribution for types of cattle is being imputed, then the only records searched in order to 
find a donor would be members of the subpopulation where cattle have been reported. A given 
record may be a member of several of the 30 different subpopulations. In some cases, all clean 
records within the imputation region are deemed suitable donors in which case the general 
population in the imputation region is defined as the appropriate subpopulation. 

The final constraint on the file of eligible donors is the fact that records requiring any donor 
imputation themselves cannot be used as donors. However, records requiring only internal im- 
putation may be used as donors. 

In summary, the file of eligible donors consists of all records not requiring donor imputa- 
tion that are members of the subpopulation specified by the imputation action to be perform- 
ed and that are also located in the same imputation region as the bad record. 
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As some records require more than one JA to be performed, there is need for a hierarchical 
system of imputation action execution. To specify the order in which the IA’s are to be per- 
formed, every IA, both internal and donor, has one of three “orders” associated with it. LA’s 
of order 1 are performed first, followed by IA’s of orders 2 and 3 respectively. 

To aid in the selection of a suitable donor record, one or more variables not requiring 
imputation are selected to be used as matching variables for each donor IA. These matching 
variables, selected by subject matter experts, are considered to be highly correlated with the 
field(s) requiring imputation. Both the recipient and the selected donor record should have similar 
matching variable values. As the use of continuous matching variables does not permit exact 
matches, a distance function based on the selected matching variable(s) is used to identify the 
closest eligible donor to the bad record. 

Each donor IA has one of three possible search types associated with it. Partition searches 
(type 1) are performed when only 1 discrete matching variable is specified for the IA. Binary 
searches (type 2) are performed when only 1 continuous matching variable is specified for the 
IA. Multivariable searches (type 3) are performed when 2 or more continuous matching variables 
are specified for the IA. Each of these three search types is described individually in the following 
sections. Other combinations of matching variable types are not employed. 

Finally, after a suitable donor has been selected and if specified in the IA control informa- 
tion, the donated data from the donor record are prorated before transferring them to the reci- 
pient record. For example, if the variable “number of trucks” is used as a matching variable 
for imputing “value of trucks”, then the value of “value of trucks” assigned to the recipient 
record is equal to “value of trucks” of the donor, multiplied by the ratio “number of trucks” 
of the recipient divided by “number of trucks” of the donor. 

As previously described, each donor imputation action has one of three search types associated 
with it. Two of these search types, binary and partition searches, are used to perform imputa- 
tion actions for which only 1 matching variable is specified. The other search type, the multi- 
variable search, is performed when 2 or more continuous matching variables are to be used. 


4.2.1 Type 1 — Partition Searches 


Partition Searches are performed when only 1 discrete matching variable with a small number 
of possible values is specified for the imputation action, e.g., as in the case where a respondent 
reports the total number of tractors, but neglects to give the corresponding total dollar value. 
Since a farmer is unlikely to have more than 3 tractors the donor population is divided into 
3 partitions: 1, 2, or 3+ tractors. A donor is chosen at random from the partition to which 
the recipient record belongs. If there are no donor records within the partition to which the 
recipient record belongs, but there are donors in any of the subsequent (higher numbered) par- 
titions, then all of the subsequent partitions are collapsed into one and a donor record is selected 
at random from this collapsed partition. If there are no donor records in the partition to which 
the recipient record belongs or in any subsequent partition, then a donor record is selected 
at random from the closest preceding (lower numbered) partition that contains any donor records. 
As these collapsing procedures are not frequently applied, no serious introduction of bias is 
encountered. If the donor population is empty, then the field to be imputed is assigned the 
maximum value allowable by the edits and the record flagged to indicate that imputation was 
unsuccessful. These flagged records are then reviewed by subject matter personnel who manually 
assign an appropriate value to the field requiring imputation. 


4.2.2 Type 2 — Binary Searches 


Binary searches are performed when only 1 continuous matching variable is specified for 
the imputation action, eg., as in the case where a respondent reports the total value of his/her 
tractors, but does not give the corresponding number of machines. The entire file of eligible 
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donor records is searched and the record that minimizes the difference between the matching 
variable values is selected as the donor. If two or more potential donor records are equally 
close, then the one that is geographically closer to the recipient (as judged from the geographic 
ID) is automatically selected as the donor. If the donor population is empty, then the recipient 
record is flagged to indicate that imputation was unsuccessful. 


4.2.3. Type 3 — Multivariable Searches 


Multivariable searches are performed when more than one continuous matching variable 
are specified for the imputation action. These are the most complex of the three search types per- 
formed by the 1981 Census of Agriculture. The method used to perform multivariable 
searches was adapted for use at Statistics Canada by G. Sande. 

When the missing data are related to more than one continuous matching variable, it is 
desirable to use as a donor a record that is closest to the recipient record on all these matching 
variables simultaneously. This requires a multivariable search on a large donor file and has 
been made practical by grouping the donor population in such a way that it is not necessary 
to search every donor to determine the closest. This specialized grouping of records is called 
the K-D (Key Discriminator) tree. The same K-D tree may be used for all records requiring 
a certain donor IA within a particular imputation region as the file of eligible donors will 
remain the same in each case. However, if a different donor IA is to be performed using a 
different donor population, or even the same donor IA on a different imputation region, a 
new K-D tree must be built as the file of eligible donors will not contain the same records. 


a) Building the K-D Tree 


The first step in the building of the K-D tree is to perform a transformation on all of the 
matching variables by subtracting the mean and dividing by the standard deviation of the donor 
population. This allows matching variables of different scales to be specified for the same search. 

After the variable transformation, the following algorithm is then used to actually build 
the K-D tree. It is first applied to the entire file of eligible donors, and then to all subfiles subse- 
quently created by the algorithm. 

Firstly, the range (largest value minus smallest value) is calculated for each of the matching 
variables specified. The median value of the variable with the largest range (or the variable 
with the smallest ID if there are 2 or more with the maximum range) is then calculated. The 
variable for which the median is calculated is called the discriminator variable. This median 
value is used to split the file into 2 new subfiles, the left subfile containing records with values 
less than or equal to the median value of the discriminator variable, and the right subfile con- 
taining records with values greater than the median value of the discriminator variable. The 
algorithm is then progressively re-applied to the resulting subfiles using all specified matching 
variables until all files become TERMINAL, at which point the building of the K-D tree is 
complete. A subfile becomes TERMINAL when either the range equals zero for all matching 
variables, i.e., all records in the subfile are identical, or if there are 16 or less records in the subfile. 

The above algorithm will yield a K-D tree of the form illustrated in Figure 4. 

Every record contained in the original file will be present in one and only one of the subfiles 
corresponding to the terminal nodes. 


b) Searching the K-D Tree 


In order to locate the best possible donor, it is necessary to decide which of the terminal 
nodes “corresponds” to the recipient record. This is done by traversing the K-D tree, using the 
transformed matching variable values of the recipient record, starting with the root node and 
proceeding until one of the terminal nodes is reached. At each node of the tree it is determin- 
ed, using the discriminator variable for that node, which of the two lower nodes the recipient 
record corresponds to. The K-D tree is traversed in this manner until a terminal node is reached. 
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Root Node 


Non-Terminal Node 


Terminal Node 


Figure 4. General Form of K-D Tree 


In order to determine which donor in the chosen terminal node is closest to the recipient 
record, a distance function is required. Because of its ease of implementation, the distance 
defined by the maximum of the absolute differences between matching variables was used. 
The selected donor record is the one that minimizes this ‘‘distance’’. 

Although the selected donor record is the closest to the recipient record contained in the 
chosen terminal node, it is possible that there are closer donor records residing in other 


x 


Nodal Boundary 
(Y.= y) 


R - recipient record 
S - selected donor 
P - possible donor 
d - distance 
N - non-selected donors 
Nodal Boundary 
Cx — x) 


Figure 5. Closer Donors From Other Terminal Nodes (two matching variables) 
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terminal nodes. This may occur only if a nodal boundary exists that is closer to the recipient 
record than the currently selected donor record. This case is shown in Figure 5 for a donor 
IA involving two matching variables; X and Y. Each quadrant represents a terminal node. 

It is evident that the possible donor P is closer to the recipient R than the selected donor 
S. This is possible because R is closer to the position of the nodal boundary Y = y than to 
S, and only donor records lying in the same terminal node as the recipient record may be selected. 

A procedure, based on the variable values used to define the nodal boundaries and known 
as the bounds-overlap-ball (B.O.B.) test, is used to determine which of the other terminal nodes, 
if any, may contain donors closer to the recipient record than the selected donor record. Only 
terminal nodes that have the potential to provide closer donors are tested, and if a closer donor 
is found, then it replaces the previously selected donor. The B.O.B. test is applied until all nodes 
that may contain closer donors have been tested. 

Finally, for all three search types, after the eventual donor record has been selected, the 
donated data values are prorated as previously described, if specified in the IA control 
information. 

It will always be possible to select a donor unless the donor population is empty. If this 
occurs then the imputation region is collapsed with another and imputation is redone. It was 
never necessary to perform this operation in 1981. 


5. CONCLUDING NOTE 


A detailed evaluation, Grenier (1983), indicated that a major portion of the edit system was 
of little data quality benefit. This was because the Correction of Rejects procedures were unable 
to correct a sufficient proportion of the edit failures. For example, Correction of Rejects was 
_ unable to correct the failures resulting from a subset of 77 of the 97 edits more than 5% of 
the time. Also, many of the edits affected less than .1% of the population. Additionally, the 
Correction of Rejects procedures were highly labour intensive and created a heavy paper burden. 
To eliminate these inefficiencies a new computer edit system will be designed for 1986. 

Statistics from the 1981 Census of Agriculture, Grenier (1983), indicated that 43% of the 
farms in Canada had at least one field imputed. Of this 43%: 


18% required internal imputation only, 
17% required donor imputation only, and 
8% required both internal and donor imputation. 


An analysis of the data distributions before and after imputation indicated that the imputa- 
tion system did not have a serious impact at the Canada level although many of the 137,390 
records imputed underwent a significant change. The system successfully handled all necessary 
imputations with only 58 records requiring manual imputation. The system was found to be 
very efficient, a processing cost of only $15,000 being incurred. Diagnostic data indicated that 
minor modifications to the system must be made for greenhouses, mushroom houses, com- 
munity pastures, and institutions, if they are to remain in the census. Due to its successful fulfill- 
ment of the requirements, it is planned to reuse the present imputation system in 1986. 
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3.2 Mathematical symbols will be italicized unless specified otherwise except for functional 
symbols such as “exp(-)” and “log(-)’”, etc. 

3.3. Short formulae should be left in the text but everything in the text should fit in single 
spacing. Long and important equations should be separated from the text and numbered 
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4. Figures and Tables 
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titles which are as nearly self explanatory as possible, at the bottom for figures and 
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4.2 They should be put on separate pages with an indication of their appropriate place- 
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If part of a reference is cited, indicate after the reference, e.g., Cochran (1977, p. 164). 

5.2 The list of references at the end of the manuscript should be arranged alphabetically 


and for the same author chronologically. Distinguish publications of the same author 
in the same year by attaching a, b, c to the year of publication. Journal titles should 
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The Relationship between Statisticians and Statisticians! 


MARTIN B. WILK2 


I appreciate the honour of the invitation as after-dinner speaker at this 1985 annual meet- 
ing of the Statistical Society of Canada. 

The honour is unfortunately accompanied by a responsability, to say something worth- 
while. That is not an easy task. I thought I would approach that job in stages. So first I 
invented a title. Then I thought I would try to figure out what the title meant. And that 
was to be my speech. Regrettably, I am still unsure what the title means. But I won’t let 
that deter me. Of course, as Yogi Berra said, ‘‘If you don’t know where you’re going you 
may not get there’’. 

There are many people called statisticians who carry out a very diverse set of activities 
which are labelled statistics. In fact, at various times in my unplanned career, I have been 
various kinds of statisticians. That fact of language poses the question: What are the rela- 
tionships among these various kinds of statisticians and statistics? 

Specifically let me identify two types of statistical activity, namely probability statistics 
on the one hand and the work of statistical information development, carried out by statisti- 
cal agencies, on the other hand. What do I mean by probabilistic statistics? Without any 
attempt to be precise, I mean to encompass the discipline commonly covered in standard 
texts and lectures including notions of analyses of variance, tests of goodness of fit, design 
of experiments, variance components, Bayesian estimation and so forth. 

The results of the work of statistical agencies, like Statistics Canada and the Manitoba 
Bureau of Statistics and the U.S. Bureau of the Census, you read in the newspapers every day. 

These two kinds of work are perceived as related, and I believe are related. You might 
say the relationship has both a real and an imaginary part - and I am not at all clear what 
aspects fall into which category. 

Let us take a look at some of the manifestations of these two categories - which one might 


_ also label as white collar statistics and blue collar statistics (which terms are used purely to 


avoid laborious repetition of awkward phrases like ‘‘probability statisticians’). 
The Statistical Society of Canada seems to be predominantly an organization of white 
collar statisticians. A recent study indicated 


66% academic membership 
21% government agencies. 


The Statistical Society of Canada lists 32 persons from Statistics Canada as members, 
out of 2,000 professionals. 


' Invited address at the annual meeting of the Statistical Society of Canada, Winnipeg, Manitoba, June 1985. 


2 Martin B. Wilk, formerly Chief Statistician of Canada, Currently Senior Advisor to Privy Council and President 
of the Statistical Society of Canada. 
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Registration at this meeting likely consists mainly of white collar statisticians - interested 
primarily in the arena of probability statistics. Not only are there only a very few persons 
(8) from Statistics Canada, I must also report that there was only minimal interest of super- 
visors at Statistics Canada in sending persons to the meeting. 

Let us look at examples of output from these two categories. The official journal of the 
Statistical Society of Canada is the Canadian Journal of Statistics. It is a quarterly. The of- 
ficial release announcement vehicle of Statistics Canada is the daily, which appeared 256 
times last year. 

A comparison of titles of publications is fascinating. For the Canadian Journal of Statistics, 
I selected at random fifteen key words from 122 which represented the articles published 
in 1983. 

Here is a sample list of what white collar statisticians are writing and reading about: 

Abundance distributions 
Asymptotic properties 
Central Wishart distribution 
Chi-squared distribution 
Critical values 

Decision theory 
Growth-curve analysis 
Linear filter 

Logistic process 
Longitudinal studies 
Multivariate linear model 
Shift estimation 

Spatial time series 
Structural properties 
Weighted least-squares estimator 


Those topics are household words at this conference. But they are not the topics of blue 
collar statistical output - and many, perhaps most, blue collar statisticians would have no 
understanding of, or concern with, these topics, at all. | 

Some indication of the output of Statistics Canada is provided by the releases announced 
in the daily of April 29, 1985. 

- total number of pigs in Canada (over 10 million) 

- the number of tonnes of barley exported (over 150,000 during March 1985) 

- the number of square metres of mineral wool shipped (over 6 million) 

A further indication of Statistics Canada output is the table of major statistical indicators, 
which is updated each week in a publication, statistical highlights, sent to ministers and deputy 
ministers. These indicators include: 

Gross National Product 

Housing Starts 

Bank Rate 

Unemployment Rate 

Consumer Price Index Increase 

Weekly Earnings 

And the measures relating to economic, business, trade, financial, social and labour sec- 
tors of Canadian Society. 

Statistics Canada turns out statistical studies on topics such as divorce in Canada, health 
of Canadians, the status of women, current economic indicators, science and technology 
indicators, language characteristics of Canadians and so on. | 

I want to make it clear that I am not engaged in making an assessment of the relative 
value of these two types of outputs. Both types of work are socially desirable, as indicated 
by the fact each has supporting social constituencies. By definition, each is socially justified. 
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But what I am engaged in is trying to analyze the nature of relationship between these 
two types of activities, both of which are labelled statistics and carried out by people who 
are called statisticians. 

We could of course simply write it off as a case of homonysm - that is the same word 
being used with two entirely different meanings. Or we should simply continue to ignore 
this discrepancy. But neither of those is wise or productive. 

You are all familiar with the classic work on the advanced theory of statistics by Kendall 
and Stewart. Volume I involves 396 pages of text plus tables and index. These 396 pages 
deal with theoretical constructs of probability statistics and mathematical derivations of 
various formulae. 

The introductory quotation to the book is attributed to O. Henry and reads as follows: 


“‘Let us sit on this log at the roadside’’, says I, ‘‘and forget the inhumanity and 
ribaldry of the poets. It is in the glorious columns of ascertained facts and legalized 
measures that beauty is to be found. In this very log we sit upon, Mrs. Sampson,” 
says I, ‘‘is statistics more wonderful than any poem. The rings show it was sixty 
year old. At the depth of two thousand feet it would become coal in three thou- 
sand years. The deepest coal mine in the world is at Killingworth, near Newcas- 
tle. A box four feet long, three feet wide, and two feet eight inches deep will hold 
one ton of coal. If an artery is cut, compress it above the wound. A man’s leg 
contains thirty bones. The tower of London was burned in 1841.’’ 


“Go on, Mr. Pratt’’, says Mrs. Sampson. ‘‘Them ideas is so original and soothing. 
I think statistics are just as lovely as they can be.’’ (The handbook of Hymen). 


I think the quotation is lovely. And the book is, of course, an excellent example of scholarly 
clarity. But I do wonder what is the connection between the quotation and the text? Do the 
authors see a close connection? Is the quotation - which reflects work like that of the blue 
collar statistician-intended to justify, or motivate, the superstructure of probabilistic statistics 
which follows? 

Do the authors believe that the constructs and formulae of their text on probabilistic 
statistics serve to guide or validate the work of blue collar statisticians - of statistical agen- 
cies? Or do they believe that the discipline of probability statistics is justified because its 
technology has been used to produce the output of statistical agencies? 

What is real and what is imaginary in this relationship? 

There is something of a conundrum in the relationships between the work of white collar 
Statistics and blue collar statistics. The apparent outlook seems to be that: 


- The information product is valid because it uses approved methodology. 

- The methodology has status because it derives from a formulated theory. 

- But the statistical theory involves constructs and mathematical logic, usually based on 
various unverifiable assumptions.! 


What justifies the assumptions, the constructs and the theory? 

In scientific work, more generally, a theory is justified as good by the usefulness of the 
products produced by technology derived from the theory. 

Indeed, technology is often invented without theory and widely accepted because of its 
utility. Bronze and Damascus steel were developed because of their useful properties, and 
not because of a mathematically consistent theory of metallurgy. 

To assess whether probabilistic statistics is good, we should ask whether it provides a 
technology to produce products that are useful and valuable. 

Instead, statisticians tend to ask the inverse question, namely whether the work of blue 
collar statistics is valid according to the precepts of probability statistics. 
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Probability statistics has produced a wide variety of concepts and models and 
methodologies. These include areas such as: 
Decision making under uncertainty 
Subjective probability 
Science of inference 
Likelihood inference 
Bayesian estimation 
Time series analysis 
Hypothesis testing 
Tests of significance 
Confidence estimation 
Estimation of sampling errors 
Classification methods 
Regression analysis 
Variance components 
Design of experiments 
Sample survey design 
Unbiased estimators 

and so on. 

Many authors have asserted that the most fundamental concept in applied probabilistic 
statistics is the objective assessment of uncertainty. 

But I must tell you that that notion - however appealing and philosophically profound 
— does not comport with the reality of the work and mandate of statistical agencies. 

Let me try to establish by example the social importance of the work of blue collar statisti- 
cians. You can make a test of your own. Make a list of what you believe to be the issues 
of interest to Canadian Society. Your list will include matters of employment and unemploy- 
ment, income of the elderly, status of women, economic growth, trade and balance of 
payments, family formation, population distribution, government deficit, etc. 

On examination you will find that, for the large majority of such issues, your percep- 
tions, your knowledge and your understandings depend quite directly on the statistical in- 
formation produced by blue collar statisticians, mainly at Statistics Canada. A similar assess- 
ment would apply in any country in the world. 

To emphasize this point further by a specific example, I would like to summarize some 
of the uses of the consumer price index. 

The consumer price index is updated each month by Statistics Canada based on monthly 
observations of prices of a designated market basket of goods and services. The consumer 
price index is the most commonly used indicator of the rate of inflation. It is often referred 
to as the cost of living index. The consumer price index has a direct or indirect effect on 
nearly all Canadians. It, or individual components of which it is weighted average, is used 
in the calculations or definitions of income taxes, labour contracts, family allowance payments, 
old age security pensions, rental agreements, insurance coverage, spousal support payments, 
child support payments, payments to children of war veterans, student loan repayments, and 
many other contractual or regulatory arrangements. 

To get back to the matter of objective error estimation - supposedly the central feature 
of probability statistics: Statistics Canada does not produce a statistical measure of the error 
of the consumer price index estimate. We do not publish interval estimates of consumer price © 
index. We do not test the hypothesis of no change in consumer price index from month to 
month. We do not produce composite estimates which would supposedly reduce random 
error variance. 
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From time to time we are queried or criticized about this, even by people who are not 
statisticians or scientists. It seems that, having heard so often about the results of public 
opinion polls, members of the public have now begun to expect an error estimate to accom- 
pany published estimates. The phrase ‘‘19 times out 0 f 20” is now a part of the vocabulary 
of most newspaper readers. Of course, public opinion polls have been going on for a long 
time; George Gallup found a record of one taken back in 1824, when a pennsylvania newspaper 
published results of what was called a ‘‘straw vote taken without discrimination of parties’’. 
Modern communications and computer technology have resulted in a proliferation of polls. 
Because of their popularity, there has been an increase in public awareness of the fact that 
a statistician (or somebody.) can conduct a sample survey, make inferences, and put a measure 
of uncertainty on estimates. 

An audit of Statistics Canada in 1983 by the Auditor General of Canada, touched on 
the subject of measuring the quality of statistics. The report recommended that Statistics 
Canada develop and disclose more measures of quality for its statistics. The agency’s formal 
reply was that this ‘‘recommendation could not be fully implemented, since ‘measures of 
quality’ for many statistics - particularly those of a composite nature - are impossible to 
produce’’. It would be more realistic, said the Statistics Canada response, to supply ‘‘a full 
description of available information related to possible quality limitations, including, of 
course, quality measures when they are available’. 

Statistics Canada would publish more error estimates if we felt we could. It is not that 
we would mind admitting the possibility of error. As professor R. C. Bose used to say to 
his students ‘‘to err is human. Therefore, statisticians are human’’. 

However, the usual error estimates depend on assumptions which vastly oversimplify the 
situation. For example, the labour force sample households, not independent individuals who 
have equal chances of being selected. Also, by design, the households themselves do not have 
equal chances of being sampled; the sampling ratio is approximately 1 in 125 at the national 
level, but can be as high as 1| in 24 for provinces with small populations. Can we assume, 
then, that all individuals are independent and have an equal probability of being unemployed? 
Data are gathered by means of an interview, and either the interviewer or the respondent 
may make an inadvertent or even a deliberate mistake. Can we ignore all possible sources 
of error except sampling error? Members of a given household are sampled for six consecutive 
months, with 1.6 of the households rotating into and out of the overall sample each month. 
Thus, in any month, different respondents have responded to the questionnaire different 
numbers of times. Can we assume that the six responses are independent over time? 
Sometimes, during the six months of sampling, families move away from, or move into, 
a particular dwelling being sampled. And, of course, there are the usual problems of non- 
response, Outliers, and errors of data entry, computation, and printing, etc. Concern about 
how to handle deviations from the ‘‘usual’’ assumption of statistical theory is a major conti- 
nuing preoccupation of some of Statistics Canada’s blue and white collar statisticians. 

So, on the one hand, probability statistics has contributed the appealing and important 
concept of objective estimation of error; and moreover the public has been educated to ac- 
cept the concept and to expect it to be implemented. 

On the other hand, there are many very influential and prominent statistical products pro- 
duced by people called statisticians for which such measures are not provided, and cannot 
be provided at the present time. 

Abstractly, there seem to be several options! 


(a) Statistical agencies and probability statistics might agree to stop sharing the label 
“‘statistics’’ and abandon the notion of connectivity. 
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(b) Probability statistics could address its efforts to produce technology to deal with the 

reality of complex statistical information development. 

(c) Statisticians might undertake a public reeducation campaign to cancel the beliefs that 

neat and objective measures of statistical uncertainty are possible. 

As a practical matter, only option 2 can be considered. And it also holds greater promise 
of productive consequences for ail statisticians of all varieties. 

In an article in science last year, lan Hacking, a philosopher of science, commented that 
‘“‘the quiet statisticians have changed our world - not by discovering new facts or technical 
developments but by changing the ways we reason, experiment and form our opinions about 
ie 

It is gratifying to read such an assessment of the significance of probabilistic statistics 
as pioneered by Fisher, Neyman, Pearson, Wald and others. 

But, in the vein of my topic tonight, I want to point out that there is another cadre of 
‘‘quiet statisticians’ —- the blue collar statisticians of statistical agencies - who have also con- 
tributed to changing the world; but precisely in the manner inverse to Mr. Hacking’s 
assessment. 


Blue collar statisticians do discover new facts. 
They do establish new concepts. 
They do invent operational definitions and implement them for public consumption. 


They do pioneer technical developments - in computing, electronic dissemination of in- 
formation, computer graphics, classification systems, national accounting frameworks and 
so on. 

Again I want to remind you my intention is not to make, or to imply, an assessment of 
comparative value. The issue is: what is real and what is imaginary in the relationship of 
blue collar statistics and white collar statistics? 

Most of what blue collar statisticians do does not in reality derive from, or directly relate 
to, the constructs and theories and beliefs associated with probability statistics. And yet - 
the blue collar statisticians are somehow persuaded or coerced into paying lip service to a 
supposedly fundamental connectivity to those concepts. 

At the same time, the white collar statisticians continue with a vague belief that if only 
more of the blue collar statisticians could achieve academic respectability then probability 
statistics would really impact importantly on statistical agencies. 

The synergy which may be latent in the more effective relationship of the blue collar and 
white collar statisticians will not be developed without effort from both groups. 

I don’t have the wisdom to offer any revelationary proposals. 

Better channels of communication are obviously needed. In that spirit, Statistics Canada 
has established a program of fellowships and internships. 

Also in that spirit, Statistics Canada has established a network of advisory committees, 
including one on statistical methodology. 

A number of probabilistic statisticians are on contract as consultants to Statistics Canada. 

I expect there is much more opportunity for expanding seminar exchanges and working 
collaborations between Statistics Canada and Universities. 


There is a need for improved intellectual tolerance in both groups. Perhaps the criteria _ 


and standards for publishing need to be modified. 


Perhaps the basis for judging the acceptability of research grants by the Natural Sciences | 


and Engineering Research Council of Canada should be changed. 


Perhaps training programs could usefully be modified. Perhaps Statistics Canada should © 
offer a prize for productive developments related to outstanding areas of need in the opera- 


tion of statistical agencies. Maybe we should have a continuing list of the ten most wanted 
solutions as an incentive, and communication mode, to probability statistics researchers. 
Maybe the Statistical Society of Canada should establish a tradition that every year the 


after-dinner speaker at the annual meeting should talk about ‘‘the relationship of statisti- 


cians and statisticians’’. 
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ABSTRACT 


The use of a multivariate clustering algorithm to perform stratification for the Labour Force Survey 
is described. The algorithm developed by Friedman and Rubin (1967) is modified to allow the forma- 
tion of geographically contiguous strata and to delineate heterogeneous but compact primary sampling 
units (PSUs) within these strata. Studies dealing with stratification variables, stratification robustness 
over time, and type of stratification are described. 


KEY WORDS: Multivariate clustering algorithm; Geographic stratification; Continuous survey. 
1. INTRODUCTION 


The Canadian Labour Force Survey is redesigned after every decennial census of popula- 
tion and housing. The redesign which occured following the 1981 Census included an inten- 
sive program of research on various aspects of the sample design (Singh, Drew and Choudhry 
1984). This report describes the portion of the research program dealing with stratification 
methods. 

Because the LFS is used not only to provide information on labour force characteristics 
but also as a general design for various other household surveys, one of the principal objec- 
tives of the redesign was to increase the flexibility of the LFS for general applications. 
Stratification was considered a means of improving efficiency for general applications, as 
well as variables of particular interest to the LFS, through the application of more rigorous 
procedures than those used in the old design. 

It was therefore decided to consider the use of multivariate clustering algorithms and to 
compare them with the methods used in the old design. A non-hierarchical algorithm developed 
by Friedman and Rubin (1967) was selected on the bases of the results of evaluations of various 
algorithms by Judkins and Singh (1981) as part of the redesign of the Current Population 
Survey of the U.S. Bureau of the Census. A description of the basic algorithm and of the 
extensions which we have developed appears in section 2. 

Sections 3 and 4 describe the evaluation studies and the stratification eventually adopted 
in the two main types of area distinguished by the LFS sample design, namely non-self- 
representing units (NSRUs) and self-representing units (SRUs). Section 4 also describes how 
the algorithm was adapted to delineate the primary sampling units (PSUs) within the NSR 
Strata. 

Section 5 concludes with a number of observations on the possibility of adapting the new 
system to other applications. 


2. STRATIFICATION ALGORITHM 


The basic algorithm used for stratification is a non-hierarchical multivariate algorithm 
developed by Friedman and Rubin (1967). This choice is based on the results of studies 
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performed by Judkins and Singh (1981) and Kostanich, Judkins, Singh and Schantz (1981), 
who assessed a number of stratification algorithms for the Current Population Survey of 
the U.S. Bureau of the Census. 

The latter modified the objective function of the algorithm for sampling with probability 
proportional to size (PPS), and we have added the capacity to formulate compact, contigu- 
ous strata. A more complete description of the following appears in Foy (1984). 


2.1 The objective function of the algorithm 


The algorithm is designed to partition the stratification units (census enumeration areas) into 
strata which are as homogeneous as possible with respect to a number of variables of in- 
terest that is, by minimizing the sums of the squares within each stratum. 

The expressions for the sums of squares in the case of sampling with PPS are shown be- 
low after introduction of the following notation: 


L = number of strata to form 

N = total number of units (enumeration areas) 

N, = number of units in group (stratum) k; (VN, + N, + ... + N, = N), 
T,, = size measure of unit j in group k, 

T, = size measure of group k, 

7.301— Lotalusize, 

;Xj = Observed value of variable 7 for unit / in group k, 

.X, = total observed values of variable i in group k, 

X.. = total observed values of variable /, 

W, = weighting factor of variable 7 (see section 2.4 for further details), 


p = number of variables of interest. 


Thus, the expression of the total sum of squares with PPS, of variable i is given by 


This is also the variance expression of the estimate of ,X.. when a unit is selected with PPS. 
The total sum of squares weighted for all variables is thus 


SCT = ¥ WSCT. 
a 


The within-group and between-group sums of squares are obtained respectively by the fol- 
lowing expressions: 


scw, = 5 = A(X pees 
kal Dy jal Dope by 
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and 
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SCB; — — pineal jx acd aXe, ce 
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Their sums of squares weighted for all variables are given respectively by 


SCW 


II 
I mM D 
= 
~W 
9 
SS 


and 


SCB 


2) 
¥ W,SCB,. 
i 


The within-group sum of squares of variable i, SCW,, is also the variance expression of the 
estimate of ,X.. when a stratum, and subsequently a unit of this stratum, is selected with 
PRS. 


Once again, we have the following result: 


SCT, = "SCWFEUSCBS (Pale t. D) 
and 


SCT = SCW + SCB. 


The objective function of the stratification program is SCW, the within-group sum of squares 
weighted for all variables. We define the stratification index for variable i, as: 


'B. 
Tf Soa 00a ZAP OL, tp: 
Wer. 


A high index value indicates a good clustering. 


2.2 Identification of the Best Clustering 


One way of identifying the best clustering would be to generate all the possible partitions 
of N units into L groups and then simply select the one which minimizes the objective func- 
tion. This approach is rarely feasible because the number of possible partitions may be un- 
manageably large. 

Friedman and Rubin (1967) suggest the following algorithm. Begin with any partition of 
the N units into ZL groups. Consider moving a single unit to a group other than the one it 
is in. Move the unit to the group which offers the greatest reduction in the objective func- 
tion. If no move will produce a reduction, leave the unit where it is. Using the partition thus 
created, we process the second unit in the same way, then the third, etc. The application 
of this procedure to each unit becomes an iteration which the authors describe as a hill-climbing 
pass. After several hill-climbing passes, the algorithm reaches a point at which no move of 
a single unit will produce a reduction in the objective function. This point is described as 
a local minimum of the objective function because it is dependent on the starting partition. 
Another starting partition might have achieved an even lower value of the objective func- 
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tion. To move beyond the local minimum, Friedman and Rubin describe two procedures, 
the forcing pass and the reassignment pass. By applying their algorithm to data described 
in their article, they obtain the highest known value of the objective function 10 times out 
of 14 runs from different starting partitions. They use another objective function, which 
is maximized. With some less well-structured data, the highest value was reached in 3 out 
of 11 runs, although it is impossible to be certain that this is the optimal solution. In their 
opinion, the forcing pass and reassignment pass methods are useful only on occasion. They 
have more confidence in the results obtained through the use of a number of starting parti- 
tions. This view is supported by Judkins and Singh (1981). We therefore decided to use the 
technique involving a number of starting partitions. 

Because the algorithm moves only one unit at a time, calculation of the objective func- 
tion is simplified. Following the initial calculation of the objective function, we merely 
recalculate the contribution to the objective function of the two groups involved in the move 
of the unit in question. 


2.3 Contiguity 


Previous LFS sample designs have used strata composed of contiguous geographic units; 
that is, each unit in a given stratum had to be touching at least one other unit in the same 
stratum. One of the main reasons was the assumption that such strata would retain the effi- 
ciency of the sample design for a longer period of time than if they were formed of discon- 
tiguous units. 

In order to assess this assumption and to adopt the best possible stratification, we con- 
sidered two means of taking geography into account in the stratification. The first method 
is described by Dahmstr6m and Hagnell (1978), and consists of the use of centroids as variables 
of interest. This method uses two geographic variables (centroids), which are transforma- 
tions of longitude and latitude. It yields compact strata, that is, strata in which the distance 
between units is made minimal by minimizing the usual within-group sum of squares of the 
centroids. However, the minimization is tempered by minimization of the other variables 
of interest. Moreover, there is no assurance that these strata will be composed of contiguous 
units. 

The other method, which we describe as the contiguity vectors approach, is new. It 
guarantees contiguous, but not necessarily compact, strata. Studies described in section 3 
dealt with the use of each of these methods in isolation or in combination. 


2.3.1 Contiguity Vectors 


To ensure the formation of contiguous strata, we proceeded as follows. Optimization is 
performed as described in the preceding section but beginning, in this case, with a starting 
partition which is contiguous, and permitting the movement of unit j from stratum A to 
stratum B only if, in addition to reducing the sums of squares, the following conditions are 
met: 


i) unit / is contiguous to a unit in stratum B 
ii) the movement of unit 7 to stratum B will not disrupt the contiguity of stratum A. 


In order to verify these two conditions, it is essential that we know the links of contiguity 
between the units. Consequently, each unit must be assigned a contiguity vector containing 
a list of the units contiguous to it. 

The first condition is easy to verify. In order to ensure that unit / is contiguous to a unit 
in stratum B, we must simply find one unit in its contiguity vector which is in stratum B. 
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The second condition is more difficult to verify. The principle is that a stratum is said 
to be contiguous if each pair of units in that stratum can be connected by a contiguous chain 
of units in that stratum. Suppose we want to move unit J from stratum A to stratum B. We 
therefore have to find, for each pair of units in the contiguity vector of unit / within stratum 
A, another link from among the units of stratum A. At this stage, the problem becomes 
like finding a path through a maze. 

An algorithm has also been designed to create random starting partitions whose strata 
are contiguous. 


2.4 Weighting of Variables 


The weighting factors are of particular importance, since they determine the contribution 
of each variable to the cluster analysis. 

It is usually preferable to standardize the variables by making the weighting factors in- 
versely proportional to the total sum of squares of each variable. This standardization makes 
it possible to obtain a comparable contribution by each variable to the cluster analysis. 

If, after standardization, we want to assign one or more variables greater importance in 
relation to the other variables in the optimization, we can do so by specifying a weight greater 
than 1 (normal). For example, a variable with a weight of 2 would have double importance. 
As described in section 3.2, we tested a number of combinations of weights for the geographic 
and non-geographic variables in an effort to obtain compact strata without unduly affecting 
the minimization of the other variables. 


3. STRATIFICATION IN NON-SELF-REPRESENTING UNITS 
3.1 Old Design (Platek and Singh 1976) 


For the purposes of the LFS, each of Canada’s ten provinces is divided into a number 
of economic regions (ERs), consisting of areas having similar economic structures. The boun- 
daries of the ERs are determined in consultation with the provinces. These ERs are used 
as primary strata. The next stage in stratification is the partition of each ER into self- 
representing units (SRUs) and non-self-representing units (NSRUs). The self-representing 
units are cities in which the expected sample is large enough to represent at least one inter- 
viewer assignment; the NSR part make up the rest of the ER. Different sample designs are 
used in the SRUs and the NSRUs, because the population in the NSRUs is much more wide- 
ly dispersed, necessitating a larger number of sampling stages. For the same reasons, we are 
retaining the concept of the SRUs and the NSRUs in the redesign. 

In the old design, the NSR portion of each ER was stratified into a maximum of 5 con- 
tiguous strata with a population of between 36,000 and 75 ,000, based on the main 
characteristics of the 1971 census population, as described below and as discussed at greater 
length by Platek and Singh (1976). 

The labour force was divided into 7 categories by industry. In each ER, the three largest 
industries were selected on the basis of specific criteria. The unit chosen for stratification 
was the combined municipality, which is the geographic region enclosed within a rural 
municipality and as such, often contains within its boundaries urban municipalities which 
are geographically smaller. By comparing, for each of these units, the proportions of the 
labour force working in each of the three categories with the corresponding proportions at 
the ER level, we identified the units showing a certain similarity which were grouped into 
strata. This comparison was done visually with graphics. Adjustments were occasionally 
necessary to satisfy the size and contiguity constraints. 
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Within each stratum, 12 to 15 PSUs were formed, all of them representative of the stratum 
in terms of the stratification variables, and of the ratio of rural to urban population. The 
rural parts of the PSUs were formed of contiguous EAs, and the urban parts were chosen 
to be as near to the rural part as possible. The sizes of the strata and the PSUs were deter- 
mined so that, with two PSUs per stratum, the expected sample was equivalent to one inter- 
viewer’s assignment size. On the basis of these criteria, and depending on the province, the 
population of the PSUs varied between 3,000 and 5,000 persons. Within the PSUs, sampling 
occurred in 2 or 3 stages. 


3.2 Studies on Stratification during Redesign 


Our studies were designed to produce conclusions which would assist in certain decisions 
relating to the following aspects of stratification: variables to be used, types of strata (whol- 
ly rural, wholly urban, or mixed), and the importance to be assigned to contiguity. Given 
the very limited time available for studies prior to the formation of the new strata and PSUs, 
and the general expectation that contiguous strata would be preferable over time to discon- 
tiguous strata, the first two aspects were given priority. 

Some experimenting was required to find the best means of achieving contiguity, either 
by contiguity vectors, centroids or a combination of the two. However, following the redesign, 
a more detailed study was undertaken on the relative desirability of contiguous versus discon- 
tiguous strata. 


3.2.1 Study on Variables and Type of Stratification 


One constraint on the stratification method used in the old sample design was the limited 
number of stratification variables which could be taken into consideration (3 per ER). 

With the new algorithm, this constraint is eliminated. In addition to the seven industry 
variables, we wished to determine the effect caused by the use of variables relating to the 
survey topic, such as employment, unemployment and income, and by such characteristics 
as education, housing and population. The latter characteristics have proven extremely effi- 
cient in similar studies performed by the U.S. Bureau of the Census for the Current Popula- 
tion Survey. 

Table 1 describes the various options studied with respect to the choice of variables. 

As regards the type of stratification, it was decided to study the effect of having separate 
strata for the rural and urban parts of the ERs, as an alternative to the mixed method of 
the old design. 

The constraints on the sample design requiring PSUs to be approximately equivalent in 
population size, and the ratio between rural and urban population to remain generally the 
same for each PSU, frequently resulted in a lack of contiguity between the rural and urban 


parts of the PSUs. This led to an erosion in the presumed correspondence between the PSU | 
and the interviewer assignment. Stratification into separate rural and urban parts, which could | 
be substratified on an optimal basis, was, it was felt, a possible solution to this problem. — 


The study dealt with 11 economic regions from across Canada. The strata were defined 


on the basis of 1971 Census data, and assessed on the basis of 1981 census data. In perform- — 


{ 


ing the stratification, we used the 1971 Census enumeration areas as our stratification unit, | 
except in Quebec and Ontario. For these two provinces, we selected census subdivisions, since — 


the large number of EAs in certain ERs (up to 400) would have made execution of the com- 
puter programs extremely costly. 


We used a conversion file between the geographic units of the two censuses to perform © 


the evaluation based on the 1981 Census. The indices based on the 1981 data were considered | 


¢ 


more appropriate for evaluation purposes, since in fact the stratification data will be an average — 
of 7 or 8 years old for the life of the sample design. Table 2 shows the indices based on 


both 1971 and 1981 census data. 
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Table 1 


Stratification Options by Variables 
a Be ee a 


Stratification option 


Variables 

1 2 3 4 5 
ee ee 
Industries (7)? x x Xx X x 
Income x x x X 
Employed x x x 
Unemployed nx x? x 
Demography (2)° X Xx 
Housing (4)? x x 
Education (1)° x x 


en a ee ee SS ee Pe 2 Panes Where tht Tees 
“ number of persons employed in agriculture, forestry and fisheries, mines manufacturing, con- 


struction, transportation, services. 
> double weighting on unemployment. 
© population 15-24, population 55 and over. 
4 |-person households, 2-person households, owned dwellings, total gross rent. 
€ secondary education. 


For this study, we chose to form contiguous, compact strata, using contiguity vectors and 
centroids with an average weight of three (see subsection 3.2.2). The number of strata per 
ER was the same for all options. 

The following conclusions were drawn from the results of the study, which are summarized 
in table 2. 


Type of Stratification: Rural/urban stratification was far superior to total stratifica- 
tion in the case of the agriculture variable, which is not surprising. The same phenome- 
non was evident for the manufacturing variable, although it was less spectacular. For 
the income variable, rural/urban stratification was also initially more satisfactory, but 
it was not particularly robust (that is, the index deteriorated over time). Rural/urban 
stratification was preferable for the unemployed variable, while there was little differ- 
ence for employed. 

Stratification Variables: Option 4, in combination with rural/urban stratification, was 
clearly superior for the unemployed variable. As regards the other variables, option 
5 was slightly more satisfactory than the rest for employed and income. 


3.2.2 Study on Contiguity 


: 
: 


As previously mentioned, it was decided to retain the concept of contiguous strata for the 
LFS. Such strata should be better for the production of small area estimates, because of 
their better geographic representation. In addition, it was felt that contiguous strata would 
maintain the efficiency of the sample design for a long period of time. 
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Table 2 
Stratification indices for Option 


Total Rural/Urban 
Stratification variables PES oe aittabw ue alter 
1971 1981 1971 1981 
Unemployed 
7 industries 5.4 0.1 9.9 3.8 
7 industries + income + employed + 
unemployed a4 23 10.2 3.4 
7 industries + income + employed + 
unemployed x 2 7.4 Zs 10.2 5.3 
17 variables 6.3 6.4 113 4.7 
15 variables (excluding employed + unemployed) 3.6 0.1 9.8 9.0 
Employed 
7 industries 2.9 0.5 8.9 4.8 
7 industries + income + employed + 
unemployed 8.8 PA 8.6 372 
7 industries + income + employed + 
unemployed x 2 9.1 2.8 1331 222, 
17 variables 14.1 7.8 17.2 6.4 
15 variables (excluding employed + unemployed) 6.3 1.6 11.4 3a 
Income 
7 industries 7.4 537, 18.9 9.5 
7 industries + income + employed + 
unemployed 11.2 6.8 97s | 5.9 
7 industries + income+ employed + 
unemployed x 2 10.3 6.8 28.3 9.5 
17 variables 10.5 9.4 24.4 11.9 
15 variables (excluding employed + unemployed) 21.0 See) 28.9 4.5 
Agriculture 
7 industries 7.4 9.7 37.0 26.0 
7 industries + income + employed + 
unemployed 7.6 7.8 40.0 dss 7! 
7 industries + income + employed + 
unemployed x 2 8.6 7.9 43.2 31.0 
17 variables 6.1 jd 40.3 31.8 
15 variables (excluding employed + unemployed) 7A) 0.4 Wo ig | 29.0 
Manufacturing 
7 industries 14.7 8.5 16.9 13.2 
7 industries + income + employed + 
unemployed 10.9 6.6 16.5 23 
7 industries + income + employed + 
unemployed x 2 Se) 4.3 14.8 16.1 
17 variables 12.5 je 13.3 10.7 


15 variables (excluding employed + unemployed) V2 1.4 14.1 16.4 
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The next question was how to use the centroids or contiguity vectors, or a combination 
of the two, to obtain compact, contiguous strata without allowing the geographic constraints 
to affect minimization of the other variables unduly. 

The study was performed with the same 11 economic regions. As anticipated, the use of 
contiguity vectors alone resulted in strata which were contiguous, but often irregular in shape. 
At the same time, the use of centroids alone, even with high weights, failed to provide any 
guarantee of absolute contiguity. 

By varying the weight of the centroids relative to the other variables, we found that a 
combination of a centroid weight of 3 and contiguity vectors offered a good compromise 
between compactness and non-geographic optimization. 


3.3 Design Stratification 


In view of these results and the superior results shown by a sample design using rural/urban 
stratification in a study on cost variance optimization (Choudhry, Lee, Drew 1985), we decided 
to use separate stratification for all economic regions except those in which either the rural 
or the urban population was too small to form at least one stratum. It was determined that 
each stratum should provide a sample of at least 90 dwellings, corresponding to the selection 
of two PSUs with a minimum take of 45 dwellings each. In cases where this requirement 
could not be met, we decided to proceed with overall stratification and thus to form mixed 
strata. This criterion led to the adoption of separate strata in over 2/3 of the ERs. 

As regards the stratification variables, we compromised on a stratification based on the 
15 variables of option 4 plus employed. Employed was added because its inclusion in option 
4, as compared to option 5, improved the performance of the employed and income 
characteristics. For the same reason, unemployed was excluded as a stratification variable. 

For the geographic constraints, it was decided to use the contiguity vectors in combina- 
tion with a uniform centroid weighting of 3 in all economic regions. 

A decision was also required as to the number of strata per ER. In practice, in most of 
the cases, there was no choice. According to the sample design, each PSU corresponds to 
one interviewer assignment, and we wanted to select at least two PSUs per stratum on order 
to produce unbiased variance estimates. Given these constraints, in almost 2/3 of the cases, 
only one stratum was formed with 2 or 3 selections, in the urban or rural parts or a combina- 
tion of the two. In the other cases, stratification was performed in such a way as to permit 
the selection, again, of 2 or 3 PSUs per stratum. This decision was based on another study 
showing slight reductions in variance with this approach, as compared to the old sample design 
in which 4 to 6 PSUs were selected from each stratum (Choudhry, Lee, Drew 1985). 


3.4 Study on Robustness of Contiguous and Discontiguous Strata 


Robust strata are strata that maintain the efficiency of the sample design over time. Follow- 
ing redesign, a study was performed to determine whether contiguous strata would be more 


robust over time, as had been hypothesized. 


The study dealt with three economic regions in Ontario, ERs 520, 540 and 580 (1981 


_humbering). For each of these regions, the results of the new stratification (selected for the 
_Tedesign of the LFS), which consists of contiguous strata, were compared with a stratifica- 
tion without contiguity constraints. The strata were defined on the basis of the 1981 data, 
and evaluated on the basis of the 1971 data. For the contiguous strata, we used contiguity 
vectors with centroids, while for the discontiguous strata, we tested two options using cen- 


troid weights of 0 and 3 respectively. The stratification variables used were the same 16 
variables described above (modified option 4). 
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The results are shown in table 3. We see that in general, the total index calculated on 
stratification is higher for the two options in which contiguity is not necessary, as might be 
expected (1981 column). However, these two options also give higher indices over time (1971 
column). 

Do we really need contiguous strata? Before answering this question, we would have to 
perform a more in depth study involving ERs from a number of provinces. Evaluation of 
stratification robustness would however pose certain problems. It is easy to evaluate robustness 
in Ontario, since stratification there is performed at the census subdivision level, which has 
changed very little since 1971. When stratification is performed at the level of the enumera- 
tion areas, which are very changeable, it is extremely difficult to obtain precise figures on 
robustness when the strata are neither compact nor contiguous. 

However, should it prove that stratification without contiguity is more satisfactory, this 
could compensate for the possible problems involved in production of small area estimates. 
It could also open new horizons: once contiguity constraints are eliminated, why could we 
not begin by forming compact, but not necessarily contiguous, PSUs, and then grouping 
them into strata? This question also could only be answered by further and more detailed 
studies. 


3.5 Formation of PSUs 


The clustering algorithm was modified to permit PSU delineation in rural and mixed strata. 
In the rural strata in particular, the formation of the PSUs is conceptually very similar to | 
stratification. The only difference relates to the fact that in stratification, we attempt to 
minimize the sums of squares of the geographic and non-geographic variables within each 
stratum, while in PSU formation, we want to minimize the sums of squares of the geographic 
variables (to obtain compact PSUs in order to reduce costs) and to maximize those of the 
non-geographic variables. The latter criterion enables us to obtain PSUs which are as 
heterogeneous as possible in terms of characteristics, so that they are all properly represen- 
tative of the stratum during sampling. 

There is, however, a conflict between the desired compactness of the PSUs and their 
heterogeneity, because of the tendency of adjacent units to possess similar characteristics. 
Because of low computer costs, we performed 3 delineations per stratum with centroid weights 
of 10, 15 and 20, relative to the other variables. The results of each delineation were then 
plotted on a graph whose axes are the centroids (see Figure 1). We then selected the best 
of the 3 delineations on the basis of the quality of variable optimization, as reflected by the 
stratification indices, and through reference to the graphs. A compactness index was also 
taken into consideration. In most cases, as it worked out, a centroid weight of 10 or 15 was 
selected. 


Table 3 
Stratification Indices by Geographic Constraints 
Geographic Constraints 


Contiguity and 


Economic No. of Centroids 


Region Strata pena a. (weight of 3) None 
1981 1971 1981 1971 1981 1971 | 


520 2 22,2 28.5 30.2 30.1 34.5 27.04 
540 3 21.8 14.1 24.9 17.8 ake 26.8 | 
580 “ 22.8 18.9 41.4 33.7 43.7 38.59) 
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Formation of the PSUs in the mixed strata led to an additional constraint. We wanted 
the proportion of urban population to be approximately the same in each PSU. Since we 
also wanted the PSUs to have approximately equal total populations, it was therefore necessary 
in some cases to split the large urban centres among a number of PSUs. The following solu- 
tion was adopted: 


1. The average number of parts of urban centres which a PSU will receive (N) is deter- 
mined. This number depends on the proportion of the urban population in the stratum 
and on the number of urban units. In practice, it was set at 1 or 2. Certain strata without 
sufficient population or a sufficient number of urban units were reclassified as entire- 
ly rural strata. 


2. The number of parts into which each urban centre will be divided is determined. The 
total number of parts must equal N times the number of PSUs and each urban centre 
is divided into a number of parts proportional to its population. 


CENT2 


1660000 
1650000 
1640000 
1630000 
1620000 
1610000 
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1590000 
1580000 
1570000 
1560000 
1550000 
1540000 
1530000 
1520000 
1510000 


1500000 
5660000 5667500 5675000 5682500 5690000 5697500 5705000 5712500 5720000 5727500 5735000 


CENT1 


_ Figure 1. Example of PSU Delimitation. Each stratification unit is represented by a letter identifying 
| the PSU to which it belongs. The PSUs are circled for clearer differentiation. 
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3. The optimal stratification program is applied, considering each part of an urban cen- 
tre as a distinct stratification unit and adding the urban population variable to the other 
stratification variables. The weight assigned to this variable is adjusted to obtain the 
most evenly balanced rural/urban distribution possible within each PSU, without un- 
duly disrupting compactness and overall optimization. This can be done only by trial 
and error. In practice, we found that a weight of 10 or 15 on urban population, relative 
to the other variables, produced satisfactory results. 

In the urban strata, the PSUs were composed of urban centres. In some cases, small cen- 

tres, relatively close together, were combined, without considering characteristic optimality. 

Table 4 gives the average delineation indices for the PSUs in rural, mixed and urban strata. 

For the non-geographic variables, the lowest index represents the best delineation, while the 
opposite is true for the centroids. The results are clearly better, in terms of characteristic 
optimality, for the rural and mixed strata, in which the clustering algorithm was used. The 
high indices of the centroids show that the PSUs are relatively compact. 


Table 4 
Average PSU Delineation Indices 


Type of stratum 


Variables Eeesn alk) GUE Gal ADTLIC Vallulies ie 
Rural Mixed Urban 
Agriculture 8.1 8.3 9.0 
Forestry 21.8 24.5 35.9 
Mines 20.6 36.0 57.0 
Manufacturing Naya | 2249) 5333 
Construction 9.0 11.4 Z2T 
Transportation 9.9 12.8 Papas | 
Services 9.4 12.8 29.1 
Employed Ti 1032 23.6 
Unemployed’ 13.6 14.2 18.6 
Income 8.9 Lhe? PRS | 
Population 15-24 9.4 13.4 29.8 
Population 55+ 7.4 13.9 34.5 
1-person households | 7.4 13.0 
2-person households 7.9 11.9 28.1 
Owned dwellings 6.8 £25 29.4 
Total gross rent Sal AY! 14.4 
Secondary education 9.1 10.5 17.4 
Total population’ 3,2 4.0 10.5 
Dwellings’ 5.9 8.9 18.6 
Centroid 1 91.6 O2 99.2 
Centroid 2 90.5 91.7 97.2 


7 Not used as a variable in optimization. 
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4. STRATIFICATION IN SELF-REPRESENTING UNITS 


4.1 Old Design 


The self-representing units of the old sample design corresponded to those cities large 
enough to yield an expected take equivalent to one interviewer assignment. The lower limit 
for SRUs varied from 10,000 persons in the Atlantic provinces to 29,000 in Quebec and 
Ontario. 

The large SRUs were geographically stratified by grouping 3 to 5 contiguous census tracts 
(CTs), without any attempt to optimality. CTs are geostatistical units with populations bet- 
ween 3,000 and 5,000; because of their stability from one census to the next, they are prac- 
tical operational units. It was felt that these strata would be efficient in estimating 
characteristics, and that their small size (between 10,000 and 15,000 persons) would permit 
sample updating in areas experiencing rapid growth, without disrupting the rest of the sample. 

In addition to the area frame, an open-ended frame was set-up for apartment buildings 
in the large cities. 


4.2 Study on stratification 


Three large SRUs were considered in this study, namely Quebec City, Ottawa and Toron- 
to. The stratification unit selected was the census tract. Because of operational constraints 
imposed by the stratification program, it was necessary to break Toronto up into six parts, 
corresponding generally to the city’s major natural divisions. Stratification was carried out 
separately in each of these parts. The same 16 stratification variables finally selected in the 
NSR part were used. 

Two main options were evaluated: 


Option 1: Two-level stratification: 
- contiguous, compact primary strata, with a centroid weighting of 3 and an 
expected take of approximately 150 dwellings. 
- secondary strata - 4 or 5 per primary stratum, formulated without 
geographical constraints. 
Option 2: compact stratification formulated with the use of centroids (weight of 3) and 
without contiguity vectors, comparable in size to the secondary strata of op- 
tion 1. 


Table 5 shows the results of the comparison between the old stratification and the two 
options studied. As in the NSR part, the strata were defined on the basis of 1971 Census 


_ data, and then evaluated on the basis of 1981 data. 


We see that the two options studied consistently show better indices than the old stratifica- 
tion, with the possible exception of the first three variables, which, in any case, are of limited 
importance in cities. The old stratification nevertheless performed quite well, considering 
that it was carried out without any concern for optimality. 

We also note that all three methods provide generally robust stratification over time, as 
reflected by the comparison between the indices for 1981 and 1971. Major exceptions to this 
rule, unfortunately, appear to be the employed and unemployed characteristics. 


4.3 New Design 


Given the similarity in results between the two options studied, it was decided to adopt 


two-level stratification (option 1) in large cities where the sample consists of 300 or more 
households, for the following reasons: 


i) Contiguity in the primary strata gives us a suitable unit for sample updating. 
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ii) The primary strata can be used for the formation of interviewer assignments. The size 
of the strata was determined so that the sample within the geographic area, that is, 
the area frame sample plus the sample for the apartment frame, corresponds to two 
interviewer assignments (160 households in the city core and 120 elsewhere). 


iii) Two-level stratification leads to better representation of the correlated response variance 
in variance estimates. In the old sample design, there was usually only one interviewer 
per stratum, resulting in an underestimate of this component of the variance. With 
non-geographic secondary strata, but geographic interviewer assignments, this pro- 
blem will be less frequent. 


The cost constraints associated with the computer time involved forced us to deal with 
certain SRUs on an individual basis. In fact, the Montreal region was divided into seven 
independent parts, during stratification. The same was done with Toronto (5 parts), Win- 
nipeg (2 parts), Calgary (2 parts), Edmonton (2 parts) and Vancouver (3 parts). These divi- 
sions were made on the basis of natural criteria as suggested by the geography of these regions. 

In large SRUs, apartment buildings existing at the time the sample design was developed 
were sorted by the primary strata in which they were physically located in order to achieve 
an implicit stratification of this sample. 


Table 5 
Comparison of Three Stratification Methods (SRUs) 


Old Two-level Compact 
Deo Stratification Stratification 

Variables g (Option 1) (Option 2) 

1971 1981 1971 1981 1971 1981 
Agriculture S35 2.9 3:2 1.8 3.4 1.8 
Forestry 22 23 2.1 Lyi PSP DAE: 
Mines 126 4.9 8.5 4.1 7.6 4.0 
Manufacturing 34.7 35.0 36.6 34.1 39.1 35.0 
Construction 3225 29.6 39.7 30.1 42.4 33.4 
Transportation 9.2 6.8 18.0 11.6 20.0 11.6 
Services 29.5 2h 45.8 33.1 46.7 32.1 
Employed 15.1 8.0 31.4 14.1 32.8 12.6 
Unemployed’ 14.6 so) 14.9 6.7 155 Ta 
Income 39.4 38.6 51.8 29.8 53.6 48.0 
Population 15-24 9.6 15.2 12,5 17.5 13.3 14.9 
Population 55+ 27.9 18.3 34.0 20.8 32.6 18.5 
1-person households 20.3 19.2 36.3 33.8 37.8 35.0 
2-person households 21.9 20.3 40.3 30.9 40.1 30.2 
Owned dwellings 20.3 1535 29.7 22.9 3231 24.9 
Secondary education 32.6 42.4 50.3 47.9 S126 49.1 
Population 15 +° 27.0 8.2 38.0 13.4 37.6 12.0 
Dwellings’ 21.8 18.5 41.7 33.8 42.1 34.3 


“Not used as a stratification variable. 


| 
: 


. 
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In medium-sized SRUs, where the sample was not large enough to justify two-level 
stratification, optimal strata were simply constructed by means of the stratification program, 
without the application of geographic constraints. 

The smallest SRUs, those not broken into block faces for census purposes, were manual- 
ly stratified, without any attempt at optimality. 

Finally, we might note that the phase-in period of the new sample produced a further 
constraint. For large SRUs, core areas were defined as consisting of complete old-design 
strata that were unaffected by boundary changes. By having strata in the new design respect 
these core areas, we ensured that during phase-in, the new sample in core areas represented 
the same geographic area as the old, which permitted gradual replacement of the old sample 
by the new without the need for a costly parallel build up of new sample (Mayda, Drew, 
Lindeyer 1985). 


5. CONCLUSIONS 


Use of multivariate clustering algorithm enabled us to develop a very general stratifica- 
tion, thus strenghtening the LFS in its role as a general household survey. In addition, automa- 
tion of the various stages of stratification in the NSR and SR parts, and delineation of the 
PSUs in the NSRUs, led to a significant reduction in the cost and time required to redesign 
the sample. 

The system is documented (Foy 1984) and can be used for the stratification of other surveys. 
It may also be used in situations requiring the definition of statistical or administrative regions, 
using a full range of variables. 

For the LFS, one aspect requiring further research relates to the selection of contiguous 
or discontiguous strata, and the implications of discontiguous strata on sample design. 


ACKNOWLEDGEMENTS 


The authors would like to thank Sylvie Trudel and Marc Joncas for their assistance in 
carrying out the studies mentioned in this report, and the members of the LFS Sample Redesign 
Committee for their valuable suggestions. They are also grateful to the referee for his helpful 
comments. 


REFERENCES 


CHOUDHRY, G.H., LEE, H., and DREW, J.D. (1985). Cost-variance optimization for the Cana- 
dian Labour Force Survey. Survey Methodology, 11, 33-50. 


DAHMSTROM, P., and HAGNELL, M. (1978). The formation of strata using cluster analysis. Inter- 
nal document, Department of Statistics, University of Lund, Sweden. 


FOY, P. (1984). Stratification program for the Canadian Labour Force Survey: User’s guide. Internal 
document, Census and Household Survey Methods Division, Statistics Canada. 


FRIEDMAN, H.P., and RUBIN, J. (1967). On some invariant criteria for grouping data. Journal 
of the American Statistical Association, 62, 1159-1178. 


JUDKINS, D.R., and SINGH, R.P. (1981). Using clustering algorithms to stratify primary sampling 
units. American Statistical Association Proceeding of the Section on Survey Research Methods, 
274-284. 


110 Drew et al.: Stratification in the Canadian Labour Force Survey 


KOSTANICH, D., JUDKINS, D.R., SINGH, P.R., and SCHANTZ, M. (1981). Modification of 
Friedman-Rubin’s clustering algorithm, for use in stratified PPS sampling. American Statistical 
Association Proceedings of the Section on Survey Research Methods, 285-290. 


MAYDA, F., DREW, J.D., and LINDEYER, J. (1985). Phase-in of the redesigned Labour Force Survey. 
Internal document, Census and Household Survey Methods Division, Statistics Canada. 

PLATEK, R., and SINGH, M.P. (1976). Methodology of the Canadian Labour Force Survey. Catalogue 
No. 71-526, Statistics Canada. 


SINGH, M.P., DREW, J.D., and CHOUDHRY, G.H. (1984). Post ’81 censal redesign of the Cana- 
dian Labour Force Survey. Survey Methodology, 10, 127-140. 


Survey Methodology, December 1985 Tedtt 
Volo 11, No.2, pp. 111-118 
Statistics Canada 


Sampling Microfilmed Manuscript Census Returns 


D.R. BELLHOUSE! 


ABSTRACT 


In the first part of the paper a review of the historical literature concerning microfilmed manuscript 
census records is given. Several types of sampling designs have been used ranging in complexity from 
cluster and stratified random sampling to stratified two-stage cluster sampling. In the second part, 
a method is given to create a public use sample tape of the 1881 Census of Canada. This work was 
part of a pilot project for Public Archives of Canada and was carried out by the Social Science Com- 
puting Laboratory of the University of Western Ontario. The pilot project was designed to determine 
the merit and technical and economic feasibility of developing machine readable products from microfilm 
copies of the 1881 Census of Canada. 


KEY WORDS: Computerized random sampling; Microfilmed records; Multi-stage designs; Public use 
samples; Stratification. 


1. INTRODUCTION 


To write a history of any person or people the historian must rely on the applicable source 
material. Many historians today seek to write a history of the common man. In this area 
of historical research the source material may include items such as census returns, land 
records, and business directories. This paper focuses on the use of census returns as a source 
material. The major problem with using census data is that there is a large mass of it. For 
an historian with a reasonable research budget there is not enough money, time or man- 
power to sift through all the census returns. The solution is to take a random sample of the 
returns. Most census returns available to the historian are microfilm copies of the returns. 

_ In Canada this includes the colonial censuses of 1841, 1851, and 1861 and the Census of 
Canada for 1871 and 1881. The problem then becomes one of finding the appropriate design 
to sample returns from the microfilm copies. 

In section 2 of the paper a review of sampling techniques that have been used by historians 
is given. The use of sampling techniques by historians has been very uneven. Some applica- 
tions have been very good; the use of a particular technique was well thought out and ap- 
plied. At the other end of the spectrum other historians appear to have used overly complex 

_ designs when it was not necessary. A complex design could lead to design effects much dif- 
ferent from 1 which, in turn, could lead to problems in the analysis of the data. See, for 

_ example, Rao and Scott (1981) and Holt et al. ( 1980) for discussions concerning categorical 
data analysis and Scott and Holt (1982) for regression analysis. One other problem with many 
of the surveys reviewed here is that there is insufficient discussion in the survey report to 
ascertain the reasons why a particular design was chosen. 

In section 3 of the paper a method is given to sample the returns of the 1881 Census of 
Canada for the purpose of creating a public use sample tape. The work was carried out as 
part of a project for Public Archives of Canada. The contract for the research was awarded 
to the Social Science Computing Laboratory of the University of Western Ontario. A descrip- 

tion of the sampling design is given here; a complete report of the project is found in Mit- 
Chell et al. (1982). In some ways the design is similar to the ones used for creating public 


—! D.R. Bellhouse, Department of Statistical and Actuarial Sciences, The University of Western Ontario, London, 
Ontario, Canada N6A S5B9. 
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use sample tapes for the 1971 and 1976 Censuses of Canada. The sampling designs are all 
based on stratification; however, in the case of the 1881 Census, stratification could only 
be carried out on a geographical basis. 


2. HISTORICAL REVIEW 


The sampling literature for historical census documents may be categorized by the type 
of sampling method that was used. The order of categorization followed here will be in ap- 
proximately increasing complexity of the sampling design. 


2.1 Cluster Sampling 


Ornstein and Darroch (1978) have given a simple cost efficient method of sampling and 
linking census records over time. The heart of the scheme is to form clusters of surnames 
and then to sample clusters. The clusters are defined by the first letter of the surname. If 
the same clusters are sampled over various censuses then an individual who appears in more 
than one census will be in the chosen sample. This reduces the number of cases to be examin- 
ed for linkage purposes and hence reduces the cost. This design is particularly useful for 
historical studies of migration or historical changes over time. 


2.2 Stratified Sampling 


In all of the designs considered here that used stratification, no attempt was made to use 
optimal allocation. This was because prior knowledge of the variation within strata was not 
available to any of the researchers. To obtain the required information would have increas- 
ed the cost of each project substantially. 

Hammarberg (1971) used a type of two-phase or double sampling technique in an attempt 
to decrease the bias incurred by sampling from an incomplete set of records. The records, 
sampled at the second phase, were business directories for nine counties in Indiana. In the 
first phase of sampling, he sampled from an assumed complete record set, the 1870 United 
States Census. The sampling method was stratified random sampling with proportional alloca- 
tion so that the sample is self-weighting. The strata were the nine counties. Two aspects of 
this study recur in subsequent historical sampling studies. The strata are geographical areas 
and the sample is self-weighting. 

Hammarberg (1971) also used the classical chi-square test of fit on certain variables to 
see how well his sample data fit known population distributions from the census reports. 
In many other studies no attempt was made to check the representativeness of the sample. 

Soltow (1975) used samples from the 1850, 1860 and 1870 United States Censuses to study 
wealth in the United States. For each census year he selected a sample from each microfilm 


reel so that the sample is stratified by reels, an approximate geographical stratification. 


Soltow’s design appears to be a type of systematic sampling. To choose a sample he designated 
a spot on the screen of the microfilm reader and fed the film through the reader. The feeder 


arm was given successive half-turns until the manuscript census entry at the designated spot | 


on the screen was acceptable. One criterion for ae unit selection was that the entry had 


to be male aged twenty years or older. Also, persons ‘‘with wealth of $100,000 or more were — 
sampled 40 times more heavily in 1860 than those under $100,000”’ (p.5) so that the design — 


| 


is not self-weighting. Although it is not stated, the oversampling of wealthy people appears | 
to have been done in order to obtain a reasonable number of them for comparison to the | 
less affluent sections of society. Soltow (1975) also compared his sample results to the © 


published distributions but made no statistical tests for goodness-of-fit. He found that the 


sample data conformed well to the census results in terms of averages and proportions on | 
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various variables. This was true even for variables such as mean wealth, a result which is 
surprising in view of the oversampling of the more wealthy individuals and since his estimate 
appears to be the sample average. 

In studying the relationship between ethnicity and occupation, Darroch and Ornstein (1980) 
used a sample of the 1871 Census of Canada. A description of the sampling method is given 
in Ornstein (1978). For the purposes of both studies it was necessary to oversample some 
ethnic groups so that the design used was not self-weighting. On ignoring the oversampling 
of certain ethnic groups, the sampling method used was stratified random sampling. The 
stratification is based on the geographical hierarchical structure of the census records : pro- 
vinces, districts within provinces, sub-districts, and divisions within sub-districts. The divi- 
sion corresponds to the modern enumeration area. The natural stratification variable seems 
to be divisions. However, Ornstein (1978) further subdivided divisions into smaller groups 
which comprise the strata and then sampled two households per stratum. How the further 
subdivision was made is not given, but Ornstein states that the reason for further stratifica- 
tion is that sampling two units per stratum minimizes the variance of estimates of certain 
population values. Although it is not stated, it appears that Ornstein (1978) was trying to 
increase the efficiency of stratification by forming strata within a division as homogeneous 
as possible. By stratifying in this way the cost to sample was increased. One other aspect 
of Ornstein’s (1978) method is that it was necessary to make at least two passes through 
the microfilms, the first to obtain the number of households per division and the second 
to sample the household. 

Johnson (1978b) and Graham (1980) obtained a public use sample of the United States 
Census of 1900. Johnson (1978a) has described some related work in sampling the 1860 Rhode 
Island Census schedules. The sample was chosen by obtaining random lines on the microfilm, 
and then by searching for the chosen lines using a microfilm reader with an odometer at- 
tachment. Because of the sample selection procedure, the overall sample size is random. A 
number of criteria are given in Graham (1980, p. 41) for including or excluding sampled 
lines. The sampling scheme is stratified random sampling with microfilm reels as strata. The 
stratification is geographically based provided that the contiguous census returns are all 
grouped in the same microfilm. The advantages of this scheme are that it is operationally 
efficient and only one pass through the microfilm is needed. Also, it avoids the problem 
of empty strata or one unit per stratum when the sampling fraction within a stratum is small. 
One disadvantage is that, since one pass through the data is made, potentially major pro- 
blems that arise must be dealt with on an ad hoc basis. 


2.3 Stratified Cluster Sampling 


Bateman and Foust (1974) obtained a sample of farms in the northern United States from 
the 1860 United States Census. The north was divided into two strata, East and West, and 
a random sample of rural counties was chosen in each stratum. Within a county one rural 
township (the cluster) was chosen at random and information was collected on every farm 


_ in the township. One reason for clustering appears to be due to cost considerations. The 


farms were obtained from the census of agriculture schedules and demographic information 
on the owners or operators was obtained from the census of population schedules. By re- 
maining in the same township the work of matching farms to owners is minimized. Swierenga 
(1983) has provided a second reason for cluster sampling. He states that township data made 
it possible to estimate total factor productivity in agriculture and to identify the entire 


agricultural workforce, including farm laborers not residing in the 12,000 farms included 


_ in the sample (p. 793). Since the clusters, townships, were not chosen by probability propor- 


tional to size the design was not self-weighting. 
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Bateman and Foust (1974) also used some tests to check the representativeness of their 
sample. As in Hammarberg (1971), they applied the chi-square test of fit to compare sample 
counts to expected population counts. For continuous variables they used the t-test. The 
estimates of the mean and variance were the simple estimates, not based on the sampling 
design. 


2.4 Stratified Two-Stage Sampling 


Hammarberg (1977) used a stratified two-stage sampling scheme to sample households 
in the 1880 census for Utah Territory. The strata are a fairly complicated amalgamation of 
five geographical regions in Utah, some counties within populous regions and some large 
towns. Within each stratum, a sample of towns or wards was chosen. Towns which were 
already strata were included with certainty. Wards are geographical divisions in the Mor- 
mon Church similar to parishes in the medieval Christian church. Then a sample of households 
was taken from the chosen towns or wards. The sample was self-weighting on the household. 
The rationale for stratifying on geographical areas, given on page 460 is compelling: 


‘“‘Because the fundamental organization of the mass of people was conceived 
geographically, and most institutional records, - both church and secular - were 
organized to correspond to these areal definitions, a sample of the population 
on an area-by-area basis is also, in large measure, a sample of the records pro- 
duced and organized for the population.”’ 


McInnis (1977) also used a stratified two-stage sampling design to obtain a sample from 
the 1861 Canadian Census. He studied the relationship between the number of children per 
family and the abundance of land in certain areas. He first stratified approximately 300 
townships by their dates of settlement. Then he took a sample of townships within strata 
and samples of farms within townships. His reason for choosing a two-stage sample appears 
related to cost. A sampled farm was matched to the entry in the agricultural census. It takes 
less time and hence costs less to sample a few townships and match records for several farms 
within a township than to stratify on townships and match this record for a small number 
of farms. The same argument applies to Hammarberg’s (1977) work. He was also linking 
other records to the sampled household. 


2.5 Stratified Two-Stage Cluster Sampling 


Smith (1978) used a stratified two-stage cluster sampling scheme to study older Americans 
in the 1900 United States Census. The strata are described as census regions with the coun- 
ties within these regions as the primary sampling units. The primary sampling units, coun- 
ties, are chosen with probability proportional to the size of their population. Within a coun- 
ty, several pages of census returns were sampled. Every individual over the age of 50 on 
each sampled page was recorded. Cluster sampling was necessary since it was too expensive 
to identify every individual eligible to be sampled. There is also an attempt to compare some 
sample distributions to the published census results. The statistic used is the standard test 
statistic for hypotheses on a single proportion although the data are multinomial. 

A second stratified two-stage cluster sample known as the Parker-Gallman sample is 
described in Foust (1968, ch. 2). This sample was drawn from the 1860 Census of the United 
States to study the cotton growing regions in the South. The strata were 405 Southern cotton 
counties, those counties which produced 1,000 or more 400-pound bales of cotton in the 
year preceeding Census day. Within a county a systematic random sample of pages from 
the manuscript census was chosen; with a selected page a block of five farms was chosen 
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at random, the block being the cluster. Cluster sampling was used because information on 
a particular farm had to be accumulated from three different census schedules. The mat- 
ching of the farms in the schedules was described as very laborious. Fogel and Engerman 
(1974, pp. 22-25) have listed several additional samples related to the Parker-Gallman sam- 
ple. Bode and Ginter (1984) have criticized the content of the sample. 

Of the large number of samples reviewed here, the Parker-Gallman sample and the samples 
drawn by Bateman and Foust (1974) are the two that have been most extensively studied. 
Swierenga (1983) has reviewed much of the work based on these samples. 


3. PUBLIC USE SAMPLES FROM THE 1881 CENSUS OF CANADA 


Early in the 1980’s Public Archives of Canada obtained Schedule |: Nominal Return of 
the Living for the 1881 Census of Canada. The returns were microfilmed and currently copies 
are available in most academic and many public libraries. After producing the microfilm 
copies, Public Archives of Canada was then interested in producing a machine readable edi- 
tion of the entire census and/or a machine readable public use samples similar to the public 
use samples for the censuses of 1971 and 1976 (see Statistics Canada (1975, 1979) for documen- 
tation). The Social Science Computing Laboratory of the University of Western Ontario 
obtained a contract to perform a feasibility study and the author was asked to design a sampl- 
ing scheme to construct the public use sample. In this section the proposed design is describ- 

_ed. A report of the feasibility study is found in Mitchell et al. (1982). 

Schedule 1 contains information on each individual on age, sex , country of birth, ethnic 
origin, occupation, marital status, whether or not the person had certain disabilities. The 
other seven schedules contain information on industry, agriculture, forestry, fishing, and 
mining. A brief description is found in Census of Canada 1880-81 Vol. 1, pp. v-xv. 

The basic requirements of the public use samples are briefly described. To conform to 
the 1971 and 1976 public use samples it would be necessary to have two independent samples, 
one of households and one of individuals. If production of only one sample is economically 

| feasible, however, the first priority is the household sample. The public use sample of the 
1900 Census of the United States, described by Johnson (1978b) and Graham (1980) is a 
sample of households. Moreover, the household appears to be the most important sampling 
unit desired by historians. On taking another cue from the sample of the 1900 census, a sam- 
ple size in the order of one hundred thousand individuals for either the individual or the 
households sample is desirable. For the 1881 Census of Canada this would result in an ap- 
proximate 214% sampling fraction in either sample. Finally a stratified sampling design with 
proportional allocation with geographical areas as strata for both samples is desirable. This 
: conforms to sampling practice so far in the historical literature and ensures a self-weighting 
| design. Within a stratum the units should be chosen by simple random rather than systematic 
sampling. Although convenient, Johnson (1978a) has maintained that systematic sampling 
is not appropriate for manuscript census schedules. Neighbours possess similar characteristics 
and would never be included together in a systematic sample. Historians may be interested 
In studying those individuals with like characteristics. 
_ Based on these basic requirements the following sampling scheme was proposed for the 
household sample. The design suggested was stratified random sampling with census divi- 
sions (the modern enumeration area) as strata similar to Ornstein (1978) rather than microfilm 
teels as used by Johnson (1978b) and Graham (1978). The census divisions provide natural 
geographical strata. In addition, the households are consecutively numbered on the 
enumerators lists with twenty-five individuals per census manuscript page. Thus, if one 
preliminary pass is made through the microfilms the number of households in each stratum 
| could be easily obtained. With a2 - 2.5% sampling fraction and proportional allocation, sample 
| 


| 
| 
| 
| 
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sizes of smaller than two households are obtained in divisions (strata) with fewer than ap- 
proximately one hundred households. In these cases the division should be grouped with 
geographically contiguous strata. Further stratification beyond the division as in Ornstein 
(1978) seems unnecessary and would substantially add to the sampling costs. 

The sampling process can easily be made part of a computing environment. From the 
point of view of a coder sitting at a computer terminal with a microfilm reader to one side 
the sampling process is straightforward. When a coder is sampling a division, he merely presses 
the appropriate keys identifying the division he wants and the number of the first household 
to be sampled appears on the terminal screen. The coder then moves the microfilm forward 
to the appropriate household number. Once the data are entered, a next key is pressed and 
the second household number appears. When the final sampled household from that divi- 
sion is obtained, pressing the next key will result in an instruction to pick another division 
to sample. In some situations there may be missing households. For example, one or more 
of the enumerators sheets containing 25 names may have been lost. In this case, when a coder, 
in the process of sampling, encounters a missing household, the household is entered as missing 
and also any other missing household numbers that the coder may notice. The coder then 
continues sampling to the end of the division. Since at least one household sampled was missing 
the coder is instructed to rewind the microfilm and to continue sampling in the division. 
The main feature for the coder in this set-up is that with the exception of missing data situa- 
tions the coder need only move the microfilm reel forward. 

The computing algorithm behind this sampling method utilizes a file containing informa- 
tion about the divisions or division groupings and Bebbington’s (1975) algorithm for draw- 
ing a simple random sample without replacement. After the initial pass is made through the 
microfilms a file is created containing the division identifier and the number of households 
in the division. If the divisions have been grouped then the size of each is recorded. When 
a coder identifies a division to be sampled the appropriate file entry is examined and the 
division size is obtained. The required sample size in the division is the division size times 
the sampling fraction for the whole survey which yields proportional allocation. Then Bebb- 
ington’s (1975) algorithm makes a sequential choice of sample units from an ordered list, 
the list here being the ordered household numbers in a division. Each household number 
is examined in turn and is selected for or rejected from the sample. When a household number 
is selected the number is printed to the terminal screen and the selection procedure pauses 
for data entry. The sample numbers selected will be in increasing order so that a forward 
search only is necessary on the microfilm. 

Sampling collapsed strata or grouped divisions can also be done using this algorithm. Sup- 
pose L strata of sizes N,,..., N, have been grouped into one stratum of size 
N=WN,+N, +... + N,. It is necessary only to use the stratum sizes to obtain the sampl- 
ed household in each stratum. Suppose in the algorithm units s(1), ..., s(m) have been chosen 
for ‘thessample; lesus@wa Ne If forvany Wi =i1yes,, DeNie tes GN, soe 
N, + ... + N,_, + N,, where N, + ... + N,_-,; = 0 forh = 1, the unit s(/) is in stratum 
h and the household number within that stratum is s() — (N, + ... + N,_1). 

The general sampling algorithm can also be modified to account for missing households. 
The method described does not require enumerating these missing households prior to sampl- 
ing. When the sampling of a stratum by Bebbington’s (1975) algorithm has been completed, 
two possibilities arise: no missing households were encountered or some were encountered. 
In the former situation, there is no problem; the sampling has been completed for that situa- 
tion. In the second situation, the achieved sample size, say m, is less than the desired size 
n. To obtain a sample of size n of the existing households, it is necessary to sample n — m 
additional households. To achieve this, the sampling process for this stratum is started again 
but a list is created of the sampled and known missing households. Suppose there are M 
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previously sampled and known missing households (M = n: acoder may notice and record 
households that are missing other than those which were chosen for the sample). Define an 
N-dimensional vector v where the value of the u‘* entry isv(y =" fOr =O LON. 
The u" entry is a pointer to the u'* household in a division. Now delete all entries in v cor- 
responding to households on the microfilm which are missing or previously sampled and 
collapse the vector into an (N — M)-dimensional vector w. The values 
w(u),u = 1, ..., N — Mwill contain the household numbers left to sample. In the algorithm, 
it is necessary only to restate the population size as N — M and the sample size asm — m. 
A separate and independent sample of individuals can be easily obtained using the 
household method of sample selection with slight modifications. The key to the modifica- 
tions is that the pages of the enumerators lists are numbered with 25 names to a page. In 
the first pass through the microfilms, it is necessary to find the final page number and the 
number of lines on the last page of each division. On applying Bebbington’s algorithm the 
computer will print the page and line number of the individual sampled. 
This method of sample selection has been programmed and tested by the Social Science 
_ Computing Laboratory with positive results. For example, in the feasibility study the percen- 
tage of time spent searching for sampled units represented approximately 6% of the total 
estimated data entry time for the household sample and 18.5% for the individual sample. 
See Mitchell et a/. (1982 pp. 20-21). 
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Estimation of Total for Two Characters in 
Multiple Frame Surveys 


B.C. SAXENA, P. NARAIN, and A.K. SRIVASTAVA! 


ABSTRACT 


In this paper estimation of multiple characters in multiple frame surveys has been investigated. The 
gain due to two character study in a common survey, over separate surveys for individual characters, 
has been obtained. Cost comparison is also made between two character multi frame survey and two 
character single frame survey. 


KEY WORDS: Multi-character survey; Post-stratified estimate; Optimization; Cost comparison. 


1. INTRODUCTION 


The technique of multiple frame surveys was suggested by Hartley (1962) and subsequently 
discussed by Lund (1968), Hartley (1974), Vogel (1975), Armstrong (1979), etc. Lund sug- 
gested an alternate to Hartley’s estimator utilizing the actual division in the sample among 
various domains. Hartley (1974) further considered the problem with more general approach 
applicable to various sampling designs. He observed that most potential multiple frame situa- 
tions employed different types of units in their respective frames. Bosecker and Ford (1976) 
extended Hartley’s estimator to take advantage of stratification within the overlap domain. 
Serrurier and Phillips (1976) and Armstrong (1978) tested multiple frame techniques in 
agricultural surveys. The utility of multiple frame survey has been demonstrated in a wide 
variety of situations. In sample surveys, sometimes interest lies not only in the estimation 
of single character but several characters are required to be studied simultaneously. For a 
proper utilization of resources this is often achieved through integrated surveys. For instance, 
for estimating the production of vegetable crops, a single survey is planned to estimate the 
production of several vegetable crops. Also, besides the frame of all vegetable growers, another 
incomplete but relatively easily accessible frame of important vegetable growers may be utiliz- 
ed. In this paper, the estimation of total for two characters in multiple frame surveys has 
been considered. The advantage of studing more than one character in a single survey over 
the situation when independent surveys are planned for individual characters in a multiple 
frame situation, is also investigated. 


2. ESTIMATOR 


Let there be two overlapping frames A and B of sizes N, and N, respectively. In multi- 
ple frame surveys two samples of sizes 1, and nz are selected independently by simple ran- 
dom sampling from frames A and B respectively. The overlapping frames generate domains 
a, b and ab defined as follows: 

a: Consisting of units belonging to frame A only, 
b: Consisting of units belonging to frame B only, 
ab: Units belonging to both A and B frames. 
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The sample sizes n, and n, are split into sizes n,, n,, and n,, n,, such that n, and n,, are 
the number of units out of 1, units belonging to domains a and ab respectively. Similarly 
n, and n,, are the split of n, units belonging to domains b and ab respectively. In the multi- 
character study, there will be further split of these domains generating sub-domains as follows: 

Let there be two characters y,) and yg) under study. Then each of the usual domains a, 
ab and b are further subdivided as a(1), a(12), a(2), ab(1), ab(12), ab(2) and b(1), b(12), 
b(2) respectively. Here, a(1), a(12) and a(2) are the sub-domains consisting of units having 
character y,,, both Yq) and yy, and yy) only respectively in domain a. Similar explanation 
holds for other sub-domains ab(1), ab(12) etc. Thus the sample split in two character study 
will be as follows: 


nN, = pear Nab 


where 
Nz = Mary + Nary + Naz, ANd Nay = Ngyay + Naw) + Navar)> 
and 
Np = Ny + Nba 
where 


Ny = Nyy + Nyy + Moar ANd Nog = Nyay + Noaczy + Noacir « 


Here 114), Na), etc. are the split of n, units belonging to sub-domains a(1), a(2), etc. If 
we confine to one character then define 


Nagy = Naay + Naa + Naoay + Navar)> 
Nea) = Nay + Noa2, + Noaay + Mba12)- 


Similarly, for the second character, 14.) and Ng.) are defined. The estimate of the total 
for the first character is given by 


Bil) a F 1 % 6 ol 
yO = Bae Y 3 + Gaye Gt pony + PY a2) a2 


ei a Ds 
ats GYD ble nayat Yet) (1) 


where Yq), Y%2), etc . are the estimated totals for character y,, of the respective sub- 
domains. In the subsequent discussion, for the domains in which both the characters are 
available, the super script corresponds to the character under consideration. For the domains 
having only one character the super script is not used since the domain evidently corresponds 
to the character. 

Also, p; + g; = 1 and p, + q, = 1. Define J), Yaa, etc. as the sample means for 
respective sub-domains for character y,) and yg) respectively. 
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Thus, 
YO = NawIary + NaayI a2) + Navy PrVavy + UWoacy) 
+ N012) P2Yac12) is DI ra2)) 
+ NyayVoi2 + No Foay- (2) 
Similarly for the second character, we have 
YOrs NaaJa2) + NaainI 12) + Nav) (P3¥av2) + 43¥oa2) 
a Naw) (Pa I a2) ats Coun) Ge NoanFkr2) 


Fe N, b(2) J” b(2) (3 ) 
where 


P; + q; = land p, + q, = 1. 


2.1 Variance of the Estimator 


The conditional variance of the post-stratified estimates Y, Y for given sub-domain 
sample sizes ignoring the finite population correction may be written as 


2 (1)? 

rel 2 9 qi2 

= ) de Nia2) — 
a(1) a(12) 


vor? 


_— avn 
Ngay»Naa2), etc.) = Nia) 


o o 
(2 2% an 2% an(1) 
a NGo (Pie a Jig 
ab(1) ba(1) 


2 oe oF 2 Ona 
a a 

a: Newin (PIG at om + Nig) — 

N12) Nba(12) Noa 


; ot 
12 
N12) 


The unconditional variance of Y“ is approximately given by 


Ps N 2 
1 A 2 1 2 2 
VO) = ae 1 Naas 2) an NxinO Gia + Pi Novy F001) 
A 


| 2 1)2 Nz 2 1)2 

| aR P2Newi2,9 Sax $ aa 7, {Nowa a5 Noai2 x12) (5) 
| B 

| 2 2 (1)2 

| ae Gi Nanay Fava) = DNavii2y coz } 


which is equal to the variance for stratified sampling with proportional allocation. 


: 
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Similarly, 


5 N 2 
are WN 2 Q 2 2 
Voy) = Fe Nae 720 + Naa2)9 1D) + D3Nan2) F 2012) 
A 


2)2 Np 2 (2)2 
a PEN eva) 9 a | an ra { Neer + Nor) F612) 
B 


5) (2)2 
a G3 Nav2)9 ab(2) + qs Naan zh } (6) 


where 021), Jaq), etc. are the variances for the two characters in the respective sub-domains. 
For optimization of p;’s (i = 1, 2, 3, 4) for a common survey a combination of in- 
dividual variances needs to be minimized subjet to the fixed total cost for the combined survey. 
Consider the simplest linear combination 
F= WY?) + VY). 


For the common survey, a suitable cost function may be considered as follows: 


C’ = C(Mgy + May) + Col@taa2, + Mawar) + C3 Magy + Nqv2)) 


+ Cy(Myay + Moaay) + Cs(Moa2) + Moar) + Coo) + Moaey) (7) 
where C, is the cost per unit in sub-domain a(1), ab(1); C, in a(12), ab(12); C; in a(2), ab(2) 


of frame A. Similarly C,, C; and C, are the cost per unit from frame B. In the above cost 
function random sample sizes are involved. Consider the expected cost 


C = E(C’) = n,(C,®, + C,®, + C,83) + mg(C,B, + Cob; + CoB¢) (8) 


whet REN Nitta 
®, = a(1) ab(1) ®, aL a(12) ae 
N, N, 
No, + N, Nyay + Noa 
b, = (2) b(2) we b(1) ba(1) , 
N,4 Nz 
N, + Nya Nya + Noa 
®, ie b(12) ba(12) ; ®, us. b(2) ba(2) ; 
Np Np 
Or 
C => ny, C, + npCp (9) 
where 


ce = C,®, aR C,®, ae C,®, and Cp — C,P, at C;®, ata Co Pe. 
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In order to get the optimum p,’s as also n 4 and nz, the function F is to be minimised 
subject to the expected cost function as given in (9). The weight variables p;’s and sample 
sizes are obtained as follow using Lagrange multiplier: 


P, ee P, deo _ Neng 


qq q2 q3 Q4 nz Ng 
and 
2 2 2 2 2 
Na a + Kip; + K,p;, + K;p3 + K,p3 
N, ey 


Ne ys + Kiqi + Kg} + K3yq3 + K,qi a1) 
Nz Cy 


with y determined to meet the expected cost and 


K,= 200) Fadi)» K, = eae 
K,;= Nave) F202)» K, = ino he 
K,= Nia) F x1) ¥ Ni) % a2) a Naan (0 Dry aL 0%), 
| K,= Noy Fa) te NoeF 2) Gg Nia (Ow) a one): (12) 


From (10) and (11), we get 


| g Nz Cz _ Ke + (K, + K, + Ky + Ky? (13) 
| D> NGC, AWPRP PRE G kee K, + K,)p? 

| 
| 
__ This is a bi-quadratic in p and can be solved for p. The optimum sampling fractions can 
: be obtained from (11). A practical case commonly met in multiple frame situations is when 


one of the frames has got 100% coverage. Consider 100% coverage by the frame A then 
va) = Noa = Near = 0. 


| 
In this case (13) reduces to 


a K; 


| 
ee 14 
| f o —a K, + K, + K, +_kK, (14) 


where 
} 
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Assume that 
ee (Lys Dl 0) Oe ata ae (1)e re ER 
Cai) — oR » 942) = ony» Fan) = oO ha2)» Oan2) = F abi12)° (15) 


These assumptions appear plausible since the variability of one character is not likely to 
be affected by the presence or absence of the other character. Then p’ reduces to 


2 2 

a O2a)(Naay + Naa) + Gaey(Naa + Naar) 
Parr, a a eee CL ee LL 
@ — @| Oo50)(Nasay + Naar) + Fave)(Nava + Nava) 


zz, 


Dp 


or 
ae (i) Dy VS, (Get eee (Lass) 
(C=O) | Par eoeen () — 63) (16) 
where 
; O41) F O02) F O21) ; C2011) 
®; cy 2 ’ ®; ) ®; = 2 ’ ®, = 2 
O ab(1) Oan(2) 9 42) 9 ab(2) 
and 
N ay Naar Nava) Nav12) 
§, = 5) £, = ’ &, Sp ages) &, — ° 
Ni Ni Na NGS 


Using that N,, = Ng, Na + Naa» = No — Naa) and Nae, + Nawaz = Nav — Nava» it 
may be seen that the above expression of p” reduces to the usual form in uni-character case 
since &, = & = 1 and & = &, = 0. It may be remarked that the domain variances are 
generally not known as such these values are based either on prior knowledge or some guess- 
ed values. The optimality of p? is effected to that extent. 


3. COMPARISON OF MULTI-CHARACTER SURVEY WITH 
INDEPENDENT UNI-CHARACTER SURVEYS IN 
MULTIPLE FRAME SITUATIONS 


Multi-character surveys are planned with a view to economise the available resources and 
it is expected that a common survey is likely to score over independent uni-character surveys 
taking into account the cost and efficiency. In this situation the extent of gain due to a com- 
mon multiple frame survey is investigated. 

In a single character study for character y,) (say), consider simple random samples of 
sizes n, and n, from the frames A and B respectively. Here we assume that the only frames 
used before are available, not the reduced frame for each character. Define N,,, Na, 2%, 
and nj}, as the population sizes and sample sizes respectively with character y,,). Here, 
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n%, and nj, are the random sample sizes with E(nf;) = nyNa,/N, and E(nf,) = ngNp/Nz- 
In this case, the estimator Y” and its variance are as follows: 


ae : 
| gue (N.a) + Naa2)) 9 (a1), a(12)) 


rte (Nv) a Niva2) (Dp "Vari, ab(12)) te Q' Yoatt), ba(12))) 
a (Noa) ar Near) Yoay, b(12)) 
where p’, q’ are weight variables such that p’ + g' = 1 and Vata), a(12)) > Y(ab(1), ab(z» Etc. are 


_ sample means for the sample from combined respective domains, €.g. Yay, a2) is the mean 
of sample units coming from domain a(1) and a(12). 


N, 2) 
1)" 2 I 
Var’) = 7, Naw ae an Nain Ft) 
N N, 
oy. Wee aye NAb) 2 1)2 
sna 7 1g! 71, Neon Faber te Nya2)F a2) 
a 
Np 2 
1 
+ As Nowy %a) ste Noa2 9 %12))- (17) 
B 


In this case, the cost function is of the form 
C= Cnh, + Cynz, 


and expected cost is given by C* as 


C*=C_N gy ey =C + C (18) 
= + = in on 

'N, Al “Np Bl ACA B''B 

Bicre C1. =.C,N,,/N, and Ch = C,Np,/No- 


For simplicity, we assume 100% coverage by frame A, equality of variances as in (15), 
and C,/C, = C;/C, = C./C, = K. Based on these assumptions, the cost C* with n 4 and 
nz which minimize the variance (17) is given by (see Appendix for derivation). 


ow Gt HUGO + oD + ofp} + aNC.q)* 2 


1 | (® + afp’) . ata’ 


l—a ny, Np 
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where 


Similarly, for the separate survey for the 2=¢ character, the cost is obtained as 


(1 = &)[{ GU + aX(® + afp”)}? + af(C.q")?]? 
1 (i + ofp’) ata? 


l=-a Na Np 


Cet = (19) 


where 

5 K®; a 1 - &, 

eS —____, QF = —__ : 
1 + a%(1 — K) l-al- &, 


Pp 


For the combined character study, the total cost C for 100% coverage by the frame A 
is given by (8). 
Thus 


n 
C= raanace + Nova) + C2(Naa2 + Nooa2) + C3(Naa + Neo) 
A 
Np 
i — [CyNavay + CsNava2) + CoNave) | . 
Np 
Using assumptions in costs (i.e. C,/C, = C;/C, = C,/C, = K) we get 
C= Cn,[( - afaik + & + esl - & - &)} + 


aft + & + a(1 - & — &)} + alot, ge Sect G3) its fe) ence) 


where r= n,/ Np, (eS C,/C, and On = Crs. 
But in combined character study (1,/ng,) Opt. = p/aq where p is given by (16). Thus 
the gain may be obtained from the ratio. 


Cec 

ee 

(& + &JaiTr (1 - &)e3T 
+- 


(@; + afp) (@; + ap) (21) 
ra+ K 
rai - 5 


{org + & + 0,101 = Sits £,)} ai { 018; ye, +10,(1 — £, — 0} 
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where 

T, = {(@ + ofp) + af)” + atg’ VK 

T, = {(&; + ofp”) + af}? + afq" VK. 


K can be determined as follows: Using the definitions of ©7312; s (i = 1,...., 6) and 
equation (A.1), we obtain 


Cy, 2 1 Q,%, + &, + Q;9; 


Cz K 1% + 5 + 938, re 


’ 


| and thus 


(22) 


oe e'| a iE ar Orci fig oto =, — =. 


Oitsad os fh 0, s-a5,.— ©) 


The expression in (21) may be used to obtain the gain in cost due to studying both the 
character simultaneously in comparison to independent individual surveys. The percent gain 
G is thus given by 


Ed eK 
G = (= - 1) « 100 
C 


In the above cost comparison, the expected costs, C, C* and C** do not include the 
overhead costs for the combined or individual surveys, however, it is expected that the sum 
of overhead costs pertaining to individual surveys would be much larger than the correspon- 
ding overhead cost for the combined survey. Therefore, the actual gain in costs due to com- 
mon multiple frame surveys compared to independent surveys will be larger than the percent 
gain G defined above. 

The expression (21) reduced substantially under the assumptions ®@/ = ®/ = ® (say) and 
§£=&=& = & = & (say). 

From (22) @ = 1/K and from (16) since &//@; = {/®/, the p? reduces as follows: 


A K( —- a) 
Rae 
Also a¥ = a¥ = a/(1 — a). 
Therefore, from (A.1) 
|, (KG - ae)” 
es aa SRS 
Thus 
1 —- Ka 7 a VK 
7; = T, = ® Y 
l-—a l-a 
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With all these substitutions in (21) (C* + C**)/C simplifies as follows: 


(6 Onl Ah , 2£6, + (Ul —<£)io3 
c © + atp een ome 
K( —- a) 
im Ti ’ Gant <(20; 103) 
(+ eae FE | preec(on 2 os) 


SR as E(2e@, — Q3) 
e; + (1 + oe, — 203) 


where r = (n,/ng) opt. = p/aq from (10). 
Hence, 


+ — | 
Ge E(Q, Q3 ) x 100. 
Osman (L “he On +203) 


The equality of &’s does not seem to be realistic assumption. The value of G, has 
therefore been calculated using (19) for realistic and representative combinations of parameters 
and are presented in Table. 1. 

This table indicates that there is a definite gain due to integration of multiple frame surveys 
for both the characters in comparison to separate individual surveys. The gain increases with 
increasing values of o, and Q;. 


4. COMPARISON OF TWO CHARACTER MULTIPLE FRAME 
SURVEYS WITH SINGLE FRAME SURVEY 


Comparison of two frame survey with single frame surveys for study of two characters 
is of practical interest. For single character a similar study was carried out by Hartley (1962). 
On similar lines the relative reduction in cost was obtained as 


ui | 
ipeceeeewes 2S 
De 


where p” is given by (16), @ = C4/C, and a = N,,/N, 
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The reduction in cost due to multiple frame over a single frame survey is tabulated in 
Table 2 for some set of parametric values. The table indicates considerable cost reduction. 


Table 1 


Percent Gain in Cost for Common multiple Frame Survey 
for Both Characters over Individual Surveys, 


When 9 = 10; ${ = 0.256) = 0.5, &! = 1, a = 0.5. 


g = OR: & = 0.2, &, ma 0.4, &, — 0.2 


——_———— 


0.3 1:5 ano 
0.4 i, 4.2 6.4 
0.5 1.8 4.4 6.7 Sar 
0.6 1.8 4.5 6.9 8.9 10.7 
0.7 1.7 4.6 7.0 9.1 10.9 12.6 
0.8 oti 4.6 7.1 9:3 D2 12.8 14.3 
0.9 4.5 7.1 9.4 [1.3 13.0 14.5 15.9 


Ev 1012, 5 = 0.4) 20.2, Fp 04 


0.3 4.5 ea! 
0.4 4.6 9°33 13.6 
0.5 4.8 9.6 14.0 17.9 
0.6 4.9 see) 14.3 18.3 22.0 
0.7 =| 10.1 14.7 18.8 2205 Pa) 
0.8 Sy 10.4 151 1973 2521 26.5 29.7 
0.9 10.8 1555 19.8 23:0 21.1 30.3 3352 


Table 2 
Reduction in Cost for Constant Variances 
When ; = 0.25, 6; = 0.5, 6; = 1, and & = 0.2, & = 0.3, & = 0.4. 


a 

) 0.5 0.6 0.7 0.8 0.9 0.95 
100 227 175 132 094 059 040 
20 304 254 .200 .169 sh 27, 101 
10 367 Ae PAL 279 .238 193 .164 
a 462 423 387 351 308 DA 

2 661 646 634 621 599 578 


130 Saxena et al.: Multiple Frame Surveys for Two Characters 


APPENDIX 


Minimizing the variance (17) with respect to C* with the assumption of 100% coverage 
by frame A and the equality of variances, the optimum solution for p’ is obtained as 


then 1—a | oh + &) 
Oo. aa O2nay(&3 = &) 


with 


Using Nay = Nagy + Naa + Naoay + Noavaz and Na = 
ten as 


0 sree 


Cy Nw (&3 = £4) 


ORY BT Se cas 
ed aa idl 
Cy Gero tae; 


a|{l 
eee | 
K |\ af 
where 
: a ERCHIE, 
aft = 5 
ase eee 
Then we have 
l-—a & + 
ps 1 Se 
tk ceehiicg 
=|—+ 1|-a 
K | at 


Ke! 
[eon 


C, NE + &) + NoalSs + &) 


ab(1) a Nayar) Q’ can be writ- 


(A.1) 
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Define 


nl 2 2 
A, = (Naw + Nay) Fay + P (Noa) + Nopa2)) Fab)» 


ares! 2 
Ay = Q*(Niway + Naw12)) F201) » 


> 
Ww 
| 


- 2 2 2 
(Nici) aie Naz) Fa) te De (N20) ih Ni012)) Fao)» 


re 
| 


= q'? (Nav) + Nawi2)) F201) « 
With the p’ in (A.1), the optimum sample sizes will be 


2 2 2 2 
Nao _ . (Nagy + Naa2)) aoa) + p' (Navay + Nawar)) 520012) 


N, C4 

/ A; 
No Zhss Q'*(Navay + Navy) aoe) ae 4 
Nz Cy Cy 


with y’ determined wih respect to (18). From this we get 


1 
Tigo. \) Np [Ci Nair pe (A.2) 
N40 Na \CaNp3 


Also, the variances given by (5) and (17) at optimum sample sizes can be written as 


m N. N 
ViGPo ys Cotas coe 
Ny Np (A.3) 
a N N 
Vcyve) => sng + ae 
Nao Ngo 


Equating the above variances and using (A.2), we obtain expression for 149 and Ngo in 
terms of n, and nz, as follows: 
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1 
Gu Nery 
Naso CiNai 
Na Nay play 
nm, ! Np” 
and 
1 
CNet 
Ngo CiNai 
NS 4 
Na 1 Nz 2 


Using these relationships, the cost C* may be obtained as 


&, + E){C.a + a(S! + atp')}% + at(Cq’?)”]? 


C* = (A.4) 
1 (®/ + a*p?) aatq? 
+ 
1 er CX, nN, Np 
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Seasonal Adjustment of Labour Force Series during Recession and 
Non-Recession Periods 


ESTELA BEE DAGUM and MARIETTA MORRY! 


ABSTRACT 


This paper analyzes the revisions of eight seasonally adjusted labour force series during recession and 
non-recession periods. The four seasonal adjustment methods applied are X-11 and X-11-ARIMA us- 
ing either concurrent or forecast seasonal factors. The series are seasonally adjusted with these four 
methodologies according to both a multiplicative and an additive decomposition model. The results 
indicate that the X-11-ARIMA concurrent adjustment yields the smallest revisions both during reces- 
sion and non-recession periods regardless of the decomposition model used. 


KEY WORDS: Survey; X-11; X-11-ARIMA; Concurrent adjustment; Recession/non-recession. 


1. INTRODUCTION 


Seasonality in some of the labour force series may be subject to abrupt changes due to 
dramatic variations in their composition during the various stages of the business cycle. An 
important example is total unemployment. In relatively prosperous years, it consists mainly 
of persons shifting jobs, new entrants to the labour market, workers from the primary sec- 
tor (agriculture, forestry, fishing, trapping, etc.) and construction (in the winter), and students 
seeking jobs (in the summer). On the other hand, during recessions, the number of unemployed 
increases quickly and the newly unemployed are mainly regular workers from heavy industries 
and related activities characterized by seasonal variations of smaller amplitudes and seasonal 
patterns different from those in ‘normal’ years. This kind of shift was observed in Canada 
in 1981-1982, where the total unadjusted unemployment rose from 790,000 in August 1981 
to 1,494,000 in December 1982; the newly unemployed coming mainly from the manufac- 
turing and service industries. 

The rapid changes in the size and composition of total unemployment during the depress- 
ed phase of the business-cycle raises the question as to whether the procedure followed to 
estimate seasonal factors based on data for years of low, mainly frictional and ‘outdoor’ 
unemployment, is applicable to data for years of high unemployment with a large number 
of the jobless added from the secondary and tertiary sectors. 

Empirical research at Statistics Canada in 1974 led to current seasonal adjustment of labour 
force series by the X-11-ARIMA method using concurrent seasonal factors. This method 
of adjustment will be referred to as the ’official’ procedure in the sections to follow. The 
U.S. Bureau of Labor Statistics officially adopted the X-11-ARIMA method in 1980 using 
six-month-ahead projected seasonal factors. This agency also releases monthly the unemploy- 
ment rate calculated with X-11-ARIMA and concurrent seasonal factors. Concurrent seasonal 
factors are obtained by seasonally adjusting, each month, all the data available up to and 
including that month whereas projected seasonal factors are generated from data that ended 
usually one year before (in the case of the Bureau of Labor Statistics, six-months before). 

In Section 2, the mean absolute error (MAE) of concurrent and year-ahead projected 
seasonal factors is given for eight Canadian labour force series obtained from X-11-ARIMA 
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and X-11 using the multiplicative seasonal adjustment option. Year-ahead instead of six- 
months-ahead projected factors are analyzed because they are applied by several govern- 
ment statistical agencies. Furthermore, the MAE’s of six-months-ahead factors fall between 
those of concurrent and year-ahead projected factors. 

The main purpose of this study is to assess whether the use of X-11-ARIMA with concur- 
rent seasonal factors still produces the smallest revisions during recession years when com- 
pared to three feasible alternative procedures. 

In Section 3, the mean absolute revisions of the additive current seasonal adjustment are 
calculated for the four alternative procedures and MABP’s of the additive are compareed to 
the multiplicative options. 

Finally, the conclusions of this study are presented in Section 4. 


2. REVISIONS OF CURRENT SEASONALLY ADJUSTED LABOUR FORCE 
SERIES DURING RECESSION AND NON-RECESSION PERIODS 


The majority of the seasonal adjustment methods applied by government statistical agen- 
cies are based on linear smoothing filters, usually known as moving averages. It is inherent 
to these methods that the estimates from the observations of the most recent years are less 
accurate than those corresponding to central data because of the asymmetry of the end point 
filters. Among these methods, the Method II-X-11 variant developed by Shiskin, Young, 
and Musgrave (1967) and X-11-ARIMA developed by Dagum (1980) are the most widely ap- 
plied. The X-11-ARIMA method is a modified version of the X-11 variant that basically | 
consists of two steps. First, the original series are extended with extrapolation values from 
ARIMA (autoregressive integrated moving averages) models of the type developed by Box 
and Jenkins (1970), and then the extended series are seasonally adjusted with a set of moving 
averages that result from the combination of the X-11 seasonal filters with the extrapolation 
ARIMA filters. Therefore, the seasonal adjustment filters of X-11-ARIMA and X-11 differ 
for the data of the most recent year. For both procedures the same symmetric filter is ap- 
plied to central observations. If the ARIMA option is not used, then the X-11-ARIMA reduces 
to the X-11 method. | 

As more data become available, the seasonally adjusted estimate pertaining to a time point 
keeps getting revised until the data point in question is three years away from the end of 
the series and the symmetric filters apply, at which point the estimate becomes virtually fix- 
ed and is referred to as the final seasonally adjusted estimate. The difference between the 
very first and the final seasonally adjusted estimate is called the total revision. The revisions 
of current seasonally adjusted values by the X-11-ARIMA and X-11 methods are due to: 
(1) Differences in the smoothing linear filters applied to the same observations as more data 
become available; and, (2) the innovations that enter into the series with new observations. 
One would like to see the revisions of the first kind reduced to a minimum or completely 
eliminated. 

Theoretical studies by one of the authors (Dagum 1982a and 1982b) have shown that the 
revisions of current seasonally adjusted values due to filter changes can be reduced substan- 
tially if: (1) the original series is extended with ARIMA extrapolated values i.e., the 
X-11-ARIMA is applied; and (2) concurrent seasonal factors are used instead of year-ahead 
seasonal factors. The conclusion drawn from these two theoretical studies conforms to the 
results given in several empirical and theoretical works (see e.g. Dagum 1978, Dagum and 
Morry 1982, Kuiper 1978, 1981; Pierce 1980; Kenny and Durbin 1982; McKenzie 1982; Wallis 
1982; Pierce and McKenzie 1985; Otto 1985). 

Next, we examine the performance of X-11-ARIMA with concurrent seasonal factors com- 
pared to three other feasible alternatives for recession and non-recession periods. The better 
seasonal adjustment procedure will be the one that yields smaller revisions. 
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2.1 Comparisons of Four Alternative Procedures for Current Seasonal Adjustment of 
Labour Force Series 


There are four seasonal adjustment procedures commonly applied to obtain current 
seasonally adjusted values, namely: 


(1) X-11-ARIMA with concurrent seasonal factors; 

(2) X-11 with concurrent seasonal factors; 

(3) X-11-ARIMA with year-ahead projected seasonal factors; and 
(4) X-11 with year-ahead projected seasonal factors. 


The revision measure used here for the evaluation of the four alternative procedures is 
the mean absolute error (MAE) of the seasonal factors for current seasonal adjustment defined 
by: 


MAE(N) = ¥ |S; — Sp|/N (1) 


t=1 


In this expression, N is the number of datapoints included in the mean, denotes the cur- 
rent seasonal factor value which can be either a concurrent or a year-ahead projected seasonal 
factor from X-11 or X-11-ARIMA. denotes the ‘final’ seasonal factor in the sense that it 
will not change significantly when the series is augmented with new data. For X-11 and 
X-11-ARIMA, a current seasonal factor becomes final when at least three years of data are 
added to the series (Young 1968; Wallis 1974). This study analyzes the revisions in the seasonal 
factors (or implicit seasonal factors in the additive case) rather than in the seasonally ad- 
justed estimates for several reasons. First, using seasonal factors provides a feel for the size 
of revisions relative to the level of the series (it is in the form of a percentage); second, it 
standardizes the revision size within series subject to substantial jumps in level (such as the 
unemployment series); third, it allows for cross-series comparisons. 

This study analyzes the revisions in the seasonal factors (or implicit seasonal factors in 
the additive case) rather than in the seasonally adjusted estimates for several reasons. First, 
using seasonal factors provides a feel for the size of revisions relative to the level of the series 
(it is in the form of a percentage); second, it standardizes the revision size within series sub- 
ject to substantial jumps in level (such as the unemployment series); third, it allows for cross- 
series comparisons. 

Unlike in a previous paper by the authors (Dagum and Morry 1982), the revisions in the 
month-to-month movement of the seasonally adjusted data were not included in the analysis 
since these revisions are not of primary interest when dealing with labour force data (for 
example, Statistics Canada does not publish yearly revisions of the growth-rate for these 
series). Consequently, this paper focuses on the revisions in the level rather than on revisions 
in the change in level. 

The eight Canadian series of employment and unemployment analyzed here start in January 
1966 and end in October 1982. To use the ARIMA extrapolation option of X-11-ARIMA 
a period of at least five years is necessary to produce a seasonally adjusted series. Conse- 
quently, the first year for which total revision measures can be calculated is 1971. Taking 
into account the need for at least three and a half more years for a current estimate to become 
final, the last full year for which MAE can be obtained is 1977. Within this seven-year span 
of revisions, we distinguished two years of recession and five years of non-recession. The 
recession period includes data from August 1974 until July 1975 and June 1976 until May 
1977. These two years were considered recessionary because they showed high increases (greater 
than 25%) in the annual levels of total unemployment due mainly to large inflows of job losers. 


136 Dagum & Morry: Seasonal Adjustment of Labour Force Series 


Another important aspect taken into consideration is the kind of decomposition model 
used for the seasonal adjustment of each series. The X-11 and the X-11-ARIMA methods 
provide both additive and multiplicative decomposition models. There are no theoretical 
reasons for one model to be preferable to the other. They are based on different assump- 
tions concerning the generating mechanism of the seasonal component. 

In an additive model, the components of a time series (trend-cycle, seasonal variations 
and irregular fluctuations) are assumed to be independent and, therefore, the seasonal effect 
is not affected by the level of the economic activity conditioned by the stages of the business 
cycle. 

On the other hand, in a multiplicative model, the seasonal effect is proportional to the 
trend-cycle. If the seasonal factors are constant, it means the higher the level of the seasonally 
adjusted series, the higher the seasonal effect. 

The selection of the decomposition model is not crucial for the estimation of ‘final’ 
seasonally adjusted values since for most cases the corresponding figures are similar. The 
problem of model selection, however, becomes very important when approached from the 
viewpoint of the estimation of the seasonal component of the end years of a series, particularly, 
of series with a rapidly growing trend-cycle. The asymmetric filters used for the end points 
estimation, particularly those of the X-11 method, introduce large systematic errors if the 
seasonal estimates change fast (Dagum 1978). In fact, if the underlying decomposition model 
is that of a rather stable multiplicative seasonality, an additive seasonal adjustment will pro- 


duce seasonal estimates that appear to vary with the trend-cycle. Reciprocally, if stable ad-_ 


ditive seasonality is the norm, a multiplicative adjustment will produce seasonal factors that 
look unstable or fast moving. 

From the viewpoint of seasonal adjustment, it is then preferable to choose the decom- 
position model that yields the most stable seasonal estimates. The tests developed by Morry 
(1975) and Higginson (1977) have been applied to the eight series to determine the preferred 
decomposition models. 

The results of these tests indicated that only two series, unemployment of adult and young 
women, follow an additive model; the remaining series are of the multiplicative type. 

In this study, however, the mean absolute revisions have been analyzed under both assump- 
tions, that is, the components of each series are either multiplicatively or additively related. 
Weare using additive and multiplicative decomposition models for data spanning both reces- 
sionary and non-recessionary periods in order to determine which of these two decomposi- 
tion models is more sensitive to sudden changes of level from the viewpoint of revision. 

The calculations shown in the following tables were obtained from multiplicative seasonal 
adjustment. The results from additive adjustment are discussed in Section 3. 

Table 1 shows the mean absolute error (MAE) of the seasonal factors of X-11-ARIMA 
and X-11 applied for current seasonal adjustment during recession years. It is apparent that 
X-11- ARIMA with concurrent seasonal factors yields the smallest revisions. This result is 
consistent with the theoretical findings discussed above which determined that the use of 
the ARIMA extrapolation option with concurrent seasonal factors significantly reduces filter 
revisions. 

For six out of the eight series analyzed, X-11 with concurrent seasonal factors ranks se- 
cond. For the other two series (unemployed and employed adult men) X-11/concurrent shows 
the same MAE results as does X-11-ARIMA with year-ahead projected seasonal factors. Final- 
ly, the least accurate estimates are obtained from X-11 with year-ahead projected seasonal 
factors. 

Table 2 shows the relative size of the revisions from each alternative procedure with respect 
to X-11-ARIMA with concurrent seasonal factors during recession years. All the values are 
greater than 1.0 indicating that none of the alternative options gives revisions smaller than 
X-11-ARIMA/concurrent. 
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Table 1 


Mean Absolute Errors (MAE(N)) of Seasonal Factors of X-11-ARIMA 
and X-11 during Recession Years? (N = 24) 


a a SE eee ee i es oe a ee, 


Concurrent Year-ahead 
Seasonal Projected Seasonal 
Factors Factors 
Rories X-11-ARIMA X-11 X-11-ARIMA X-11 
(1) (2) (3) (4) 
Unemployment 
Men 25+ 1.95 2215 2.74 3.35 
Women 25+ 1.94 2.94 3.43 4.70 
Men 15-24 2.16 3.02 3.49 4.33 
Women 15-24 L25 13 2.48 3.44 
Employment 
Men 25+ 0.08 0.12 0.12 0.16 
Women 25+ 0.23 0.29 0.33 0.42 
Men 15-24 0.41 0.53 0.66 0.76 
Women 15-24 0.50 0.70 0.81 0.97 


* August 1974 - July 1975 and June 1976 - May 1977. 


Table 2 


Comparison of MAE(N)’s from Three Alternative Procedures Versus 
X-11-ARIMA/Concurrent for Multiplicative Seasonal Adjustment of Employment 
and Unemployment Series in Recession Years (N = 24) 


i i a ee 


X-11 X-11-ARIMA X-11 
Concurrent Projected Factors Projected Factors 
Series VS. VS. VS. 
X-11-ARIMA X-11-ARIMA X-11-ARIMA 
Concurrent Concurrent Concurrent 
(1)? Pay (3)° 


ee 


Unemployment 


Men 25+ 1.41 1.40 bs/2 
Women 25+ 1352 177 2.41 
Men 15-24 1.40 1.61 2.00 
Women 15-24 1.38 1.98 2545 
Employment 

Men 25+ 1.50 1.50 1.50 
Women 25+ 1.26 1.43 1.83 
Men 15-24 1.29 1.61 1.85 
Women 15-24 1.40 1.62 1.94 


7 (1) equals column (2) + column (1) of Table 1. 
(2) equals column (3) + column (1) of Table 1. 
© (3) equals column (4) + column (1) of Table 1. 
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The non-recession period covers from January 1971 to December 1977 excluding the reces- 
sion years. Table 3 shows the MAE of the current seasonally adjusted series for the four 
procedures during these years. Similarly to Table 1, X-1 1-ARIMA with concurrent seasonal 
factors yields the smallest revisions for all the series due to minimal filter revisions as pointed 
out before. For seven out of the eight series X-11/concurrent ranks second with values relative- 
ly close to those shown for X-11-ARIMA with year-ahead projected factors. Finally, the 
most unreliable procedure in terms of the magnitude of the revision is X-11 with year-ahead 
seasonal factors. 

The relative size of the revisions of the three alternative procedures with respect to the 
X-11-ARIMA/concurrent procedure during non-recession years are shown in Table 4. The 
figures in column (1) with the exception of one entry, however, are smaller than those shown 
in column (1) of Table 2 which would indicate that during recession years the percentage 
gains achieved by using ARIMA extrapolation are even higher than during non-recession years. 

Finally, Table 5 compares the size of the revisions during recession versus non-recession 
years for the two best procedures. The results show that X-11-ARIMA/concurrent which 
is Statistics Canada official procedure gives smaller MAE values compared to those of the 
second best alternative, X-11/concurrent. Most of the ratios in the first column are very close 
to 1.0, indicating that the revisions in times of recession are similar in size to those in non- 
recession years when using the ARIMA extrapolation option. If X-11 with concurrent seasonal 
factors is applied, the size of revision is substantially higher in most series during recession 
than in ’normal’ times. This is due to the fact that the rapid change in the level of the series, 
introduced by the new observations of the recession years, is not estimated as well by the 
end filters. In fact, gradual movements and some of the level increase are passed to the seasonal 
component. 


Table 3 


Mean Absolute Errors (MAE(N)) of Seasonal Factors of X-11-ARIMA 
and X-11 during Recession Years? (N = 60) 


Concurrent Year-ahead 
Seasonal Projected Seasonal 
Factors Factors 
é X-11-ARIMA X-11 X-11-ARIMA X-11 
Series 
(1) (2) (3) (4) 
Unemployment 
Men 25+ 1.37 1:73 Depp) 2H 
Women 25 + 1.84 2.41 2.92 3.55 
Men 15-24 1.97 2.66 Sly 3.96 
Women 15-24 1.93 2.87 2.59 3.18 
Employment 
Men 25+ 0.08 0.10 GAZ 0.13 
Women 25+ 0.23 0.27 0.33 0.34 
Men 15-24 0.39 0.46 0.58 0.69 
Women 15-24 0.43 0.49 0.68 0.80 


@ From January 1971 until December 1977 excluding recession periods defined in Table 1 footnote (a) 
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Table 4 


Comparison of MAE(N)’s from Three Alternative Procedures Versus 
X-11-ARIMA/Concurrent for Multiplicative Seasonal Adjustment of Employment 
and Unemployment Series in Recession Years (N = 60) 


SSS Se ite ee 


X-11 X-11-ARIMA X-11 
Concurrent Projected Factors Projected Factors 
Series VS. VS. VS. 
X-11-ARIMA X-11-ARIMA X-11-ARIMA 
Concurrent Concurrent Concurrent 
(1)? (2) Cy 


SS —————————— ea es ee eee 


Unemployment 


Men 25+ 1.26 1.62 1.99 
Women 25+ 1.31 1.59 1.93 
Men 15-24 135 1.61 2.01 
Women 15-24 1.49 1.34 1.65 
Employment 

Men 25+ 1.25 1.50 1.62 
Women 25+ ites ig 1.43 1.48 
Men 15-24 1.18 1.49 1.77 
Women 15-24 1.14 1.58 1.86 


* (1) equals column (2) + column (1) of Table 3. 
> (2) equals column (3) + column (1) of Table 3. 
© (3) equals column (4) + column (1) of Table 3. 


Table 5 


Comparison of MAE(N)’s of Concurrent Seasonal Factors of X-11-ARIMA and 
X-11 for Recession Versus Non-Recession Years Using the Multiplicative Option 
cee 9 ar rl eerie ee Fe Wr ete Fr) FE Silico ge BN e tona) of) 


X-11-ARIMA Concurrent X-11 Concurrent 
Recession Recession 
Years (N = 24) Years (N = 24) 
Series VS. vs. 
Non-Recession Non-Recession 
Years (N = 60) Years (N = 60) 
(1)? (2)? 
ee ee ee NE are) 8 ESS SU ek Peitosied COVES 
Unemployment 
Men 25 + 1.42 1.59 
Women 25+ 1.05 M22 
Men 15-24 1.09 1,35 
Women 15-24 0.67 0.60 
Employment 
Men 25+ 1.00 1.20 
Women 25+ 1.00 1.07 
Men 15-24 1.05 127, 
Women 15-24 1.16 ifeSvel 


* (1) equal to column (1) of Table 1 + column (1) of Table 3. 
(2) equal to column (2) of Table 1 + column (2) of Table 3. 
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The only exception is the series unemployed women 15 to 24 where revisions with both 
methods are smaller during economic hardship. This can be explained by the special behaviour 
of this series during the period analyzed, which is characterized by large annual increases 
of about 15% for 1966-73 and 8.5% for 1973-80 and an additive seasonal component, in- 
dependent of the business-cycle (i.e., the change in level reflected more the changing behaviour 
of young women than the effect of the business-cycle). 

Another special case is the series unemployed men 25 years and over. Here recession years 
were characterized by much larger revisions than non-recession periods even with ARIMA 
extrapolations as indicated by a ratio of 1.42. This large discrepancy between the two periods 
is a result of the drastic composition changes in seasonality that this series undergoes during 
times of recession as discussed before. Without ARIMA extrapolation, the revision sizes 
deviate even more (the ratio is 1.59), since apart from the changes in composition the unreliable 
seasonal estimates produced during recession introduce added discrepancies. 


3. COMPARISON OF ADDITIVE VERSUS MULTIPLICATIVE 
CURRENT SEASONAL ADJUSTMENT DURING 
RECESSION AND NON-RECESSION PERIODS 


It is often argued that during recession periods the use of an additive instead of a 
multiplicative decomposition model is to be preferred from the viewpoint of the minimiza- _ 
tion of revisions. The main reasons given for this are: (1) in an additive model, the time 
series components are assumed to be independent and, therefore, the seasonal effect is not 
affected by the level of the trend-cycle contrary to what occurs with a multiplicative model; 
and (2) the inflexibility of the end-point filters to estimate adequately fast-moving seasonality. 

The eight labour force series analyzed in the previous section was additively seasonally 
adjusted in order to assess this new alternative. The results obtained confirm the ranking 
given by the multiplicative option. Namely, X-11- ARIMA/concurrent yields the smallest 
revisions followed by X- 11/concurrent and X-11-ARIMA/year-ahead projected, in that order. 
The least accurate estimates are obtained with X-11/year-ahead projected. It is important 
to note that factors of additive seasonal adjustment mean implicit factors in the sense that 
they result from the quotient between the original series and the seasonally adjusted series. 

Tables 6 and 7 show the relative size of the revisions by each alternative procedure with 
respect to X-11-ARIMA/concurrent, for the recession and non-recession periods, respec- 
tively. All the values are greater than one indicating that none of the alternative procedures 
gives smaller revisions than X-11- ARIMA/concurrent. Since the latter ranks first for both 
additive and multiplicative seasonal adjustment options, we compare for each series which 
of the two decomposition models gives the smallest revisions. 

In Table 8 the data show that for the two series that affect the unemployment rate the 
most, i.e., the unemployment and employment of adult men, the multiplicative option is 
to be preferred during recession as well as non-recession years. For the most part, these data 
confirm the decomposition models chosen by Statistics Canada according to the model tests 
(Morry 1975; Higginson 1977). The only apparent exception is the series Employed Men 15-24 
which would do better with an additive model. However, given the fact that the size of the 
revisions is already very small, this improvement is of no consequence. The MAB’s from 
the multiplicative adjustment are 0.41 (recession period) and 0.39 (non-recession period) and 
are reduced by the additive options to 0.33 and 0.31 respectively. 

Finally, we observe that the unemployment of adult women would have smaller revisions 
with a multiplicative instead of an additive seasonal adjustment during recession years. 
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Table 6 


Comparison of MAE(N)’s from Three Alternative Procedures Versus 
X-11-ARIMA/Cncurrent for Additive Seasonal Adjustment of Employment 
and Unemployment Series in Recession Years (N = 24) 
 §s 2 7 aS ee Eee eee Pe ee eee 


X-11-ARIMA X-11 
X-11 Projected Implicit Projected Implicit 
; Concurrent Factors Factors 
Series re Se es 
X-11-ARIMA X-11-ARIMA X-11-ARIMA 
Concurrent Concurrent Concurrent 


a Nh a 0 
Unemployment 


Men 25+ 1.18 1.29 1.38 
Women 25+ 1.16 1.49 [75 
Men 15-24 Ie) 1.48 1.70 
Women 15-24 133 1.74 1.84 
Employment 

Men 25+ 1.44 1.69 2.08 
Women 25+ 1.26 L353 1.65 
Men 15-24 1.02 1.05 1.34 
Women 15-24 1.50 1.50 2.05 


Table 7 


Comparison of MAE(N)’s from Three Alternative Procedures Versus 
X-11-ARIMA/Concurrent for Additive Seasonal Adjustment of Employment 
and Unemployment Series in Recession Years (N = 60) 
ee ee eee ee ee ee eer Pts ew pe ee Se 


X-11-ARIMA X-11 
X-11 Projected Implicit Projected Implicit 
' Concurrent Factors Factors 
Series ————— es ee 
X-11-ARIMA X-11-ARIMA X-11-ARIMA 
Concurrent Concurrent Concurrent 


i a es ee ee 
Unemployment 


Men 25+ 134 1.65 1.88 
Women 25+ 1.20 1:59 EGA 
Men 15-24 22 1657 1.89 
Women 15-24 1.05 1.20 1.26 
Employment 

Men 25+ 1.16 1.24 1.54 
Women 25+ 1.10 P27, 1.30 
Men 15-24 P22 32 |b) 


Women 15-24 1.41 1.68 2.16 
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Table 8 


Comparison of MAE(N)’s of Seasonal Factors from Additive Versus Multiplicative X-11-ARIMA 
(Concurrent) Seasonal Adjustment during Recession and Non-Recession Periods 


ee EERE 


(N = 24) (N = 60) 
Recession Period Non-recession Period 
Additive X-11-ARIMA Additive X-11-ARIMA 
Series Concurrent Concurrent 
Multiplicative X-11-ARIMA Multiplicative X-11-ARIMA 
Concurrent Concurrent 
Unemployment 
Men 25+ 2 i 
Women 25 + 1.14 0.88 
Men 15-24 1.23 1.05 
Women 15-24 0.93 0.85 
Employment 
Men 25+ 1325 lez 
Women 25 + 1.00 1.00 
Men 15-24 0.80 0.80 
Women 15-24 1.14 Pelz 


4. CONCLUSIONS 


The results of Sections 2 and 3 can be summarized as follows: 


(1) The X-11-ARIMA method with concurrent seasonal factors gives the smallest revi- 
sions for each series, whether an additive or a multiplicative seasonal adjustment is 
made, during both recession and non-recession years. 


(2) The comparisons of the magnitude of the revision from additive versus multiplicative 
seasonal adjustment with X-II-ARIMA/concurrent indicate clearly that the two series 
that affect the unemployment rate most, unemployment and employment of adult men, 
are of the multiplicative type during times of recession as well as non-recession. 


(3) During recession years, the use of X-11-ARIMA with year- ahead factors and of 
X-11/concurrent yields equal MAE’s for employment and unemployment adult men. 
For the six remaining series, however, X-11/concurrent is the second best alternative. 


(4) The least accurate current seasonal adjustment estimates for all series in all the situa- 
tions discussed are obtained with X-11 with year-ahead projected seasonal factors. 


(5) The comparisons of the revisions during recession versus non-recession periods from 
X-11-ARIMA/concurrent show that they are of relatively similar magnitude with the 
important exception being Unemployed Men 25 years and over, where revisions are 
much higher in recession years. This concurs with the fact that this series undergoes 
abrupt seasonal changes because of drastic variations in its composition. The larger 
revisions are mainly due to these new innovations. 
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On the other hand, the use of concurrent seasonal factors with X-11 shows, for most 
series, large discrepancies in the size of the revisions of these two periods. This is an 
indication that revisions result mainly from the inadequacy of the end filters to estimate 
well the rapidly changing levels of recession periods. 

For only one series, Unemployed Women 15-24 years, the two best procedures yield 
revisions substantially larger in non-recessions compared to recessions. This can be 
explained by the special behaviour of this series during the analyzed period which is 
characterized by large annual increases of about 15% for 1966-73 and 8.5% for 197 3-80, 
obscuring the effect of the business-cycle; and, a seasonal component independent 
of the business-cycle. 


Given the above observations, we can feel confident that the official seasonal adjustment 
procedure at Statistics Canada will give best estimates among the alternatives considered during 
recession. 
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Relational Patterns between Total Unemployment and 
Unemployment Insurance Beneficiaries in Canada 


ESTELA BEE DAGUM, GUY HUOT, NAZIRA GAIT, 
and NORMAND LANIEL! 


ABSTRACT 


This study purports to assess whether there are temporal relationships between Unemployment Insurance 
Beneficiaries, Total Unemployment, Job Losers and Job Leavers in Canada using univariate and 
multivariate time series methods. The results indicate that during 1975-82 the Unemployment Insurance 
Beneficiaries series leads: (1) Total Unemployment by one month and (2) Job Leavers by two months. 
On the other hand, there are evidence of a feedback relationship between Unemployment Insurance 
Beneficiaries and Job Losers. 


KEY WORDS: Job losers; Job leavers; ARIMA; VARMA; Multivariate time series. 


1. INTRODUCTION 


Unemployment Insurance (UI) plays a key role in helping the national labour markets 
adjust to trade and demand-induced changes in production and employment patterns. The 
main function of UI as part of labour market policy is to provide adequate financial protec- 
tion during temporary unemployment, to facilitate adjustments. By removing the immediate 
threat from unemployment, UI relieves job seekers of the need to yield to economic pressures 
by accepting jobs unsuited to their skills or abilities. It permits a more systematic or wide- 
ranging job search contributing to the efficient reallocation of human resources. Further- 
more, when there are temporary plant layoffs, the objective of UI is met by providing in- 
come protection to laid-off workers, so the employer keeps an experienced labour force in- 
tact. This saves him/her the cost of recruiting and training new employees after a layoff. 
It also saves the employee from going through extreme dislocation to prevent financial 
hardship. 

In any situation, UI must have enough flexibility to take into account prevailing economic 
circumstances which may limit the availability of other jobs and extended jobseekers’ 
unemployment. In the Canadian UI program, this flexibility is provided as longer benefit 
durations are triggered by rising regional unemployment rates. 

The gap between overall unemployment and the UI series tends to narrow in recession 
and widen in recovery periods. Where business conditions worsen and layoffs occur, job 
losers become a greater proportion of Total Unemployment. As the most Unemployment 
Insurance claimants are in fact job losers, this increases the proportion of Unemployment 
Insurance Beneficiaries related to Total Unemployment. 

This study purports to assess whether there is a temporal relationship between the 
Unemployment Insurance Beneficiaries and Total Unemployment in Canada. The analysis 
is extended to Job Losers (JLo) and Job Leavers (JLe) who can claim for benefits and are 
the two major groups of Total Unemployment. The existence of strong relationships among 
these variables can be useful to explain labour markets behaviour. Furthermore, they may 
lead to other types of similar relationships useful to estimate unemployment in small areas 


1 E.B. Dagum and G. Huot, Time Series Research and Analysis Division, Statistics Canada. N. Gait, University 
of Sao Paulo, Brazil, was visiting Statistics Canada when the paper was written, and N. Laniel, previously Time 
Series Research and Analysis Division, currently with Business Survey Methods Division, Statistics Canada. 
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where the sample size of the current labour force survey is inadequate. Section 2 introduces 
the definition of each of the four series discussed and analyzes the main characteristics from 
their spectra. Section 3 estimates the residual cross-correlation values, for several time lags, 
of the whitened series to assess whether or not there are pairwise relationships and their direc- 
tion, if present. The residuals are computed from ARIMA models fitted to each series. Sec- 
tion 4 extends the previous analyses by identifying and estimating two multivariate time series 
models in order to understand the joint dynamic relationships of: (1) UIB and TU; and (2) 
UIB, JLo and JLe. Finally, Section 5 gives the main conclusions of this study. 


2. THE MAIN CHARACTERISTICS OF THE ANALYZED SERIES 


To understand the type of relationship between UIB and TU and its major components, 
JLo and JLe, we first introduce the definitions and analyze the main characteristics looking 
at their spectra. 


2.1 Total Unemployment (TU) 


The Labour Force Survey (LFS) Division of Statistics Canada obtains monthly informa- 
tion through a sample of 56,000 representative households across the country. Although 
developed since 1952, substantial revisions were introduced to the LFS from 1976. 

Estimates of employment, unemployment and non-labour force activity refer to the specific 
week covered by the survey each month, normally the week containing the 15'* day. The 
sample is designed to represent all persons in the population 15 years of age and over, residing ' 
in Canada, with some minor expections. 

The Labour Force is composed to people who, during the reference week, were employed | 
or unemployed. The employed includes persons who: 

- did any work at all; 

— had a job but were not at work due to illness or disability, bad weather, labour dispute, 

vacation, personal or family responsibilities. 


The unemployed includes persons who: 

— were without work, but actively looked for work in the past four weeks and were available 
for work; 

~ had not actively looked for work in the past four weeks but had been on layoff for 
26 weeks or less, and were available for work; 

— had not actively looked for work in the past four weeks but had a new job to start 
in four weeks or less, and were available for work. 


Total unemployment is composed of the sum of job losers (JLo), job leavers (JLe), new 
entrants to the labour market, re-entrants after one year or less, re-entrants after more than 
one year (Statistics Canada 1976). Of these five components, the first two are the most im- 
portant for our study since they can claim benefits and represent about 70% of TU. 

Data on the flows into unemployment are not available prior to 1975. Thus, all the series 
were observed for the period January 1975 to December 1982, thus including the most recent 
data available at the time. 

Figure 1 shows the original Total Unemployment series which is characterized by a peak 
in the winter months and a trough in the summer. Figure 2 shows the spectrum of the Total 
Unemployment series. High power is observed at the frequency 0.05 cycle/month associated 
with the business-cycle (0.05 corresponds to a 20-months cycle). Similarly, relatively high 
power is observed at the fundamental seasonal frequency 0.083 cycle/month and neighbour- 
ing frequencies, but less at the harmonics of the fundamental seasonal. Finally, the contribu- 
tion of the irregular fluctuations to the total variance is small, relative to the other two 
components. 
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Figure 1. Total Unemployment Series 
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: Figure 2. Spectrum of Total Unemployment 
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Figure 3 shows the original Job Losers series and Figure 4 displays its corresponding spec- 
trum. Similar to TU, high power is shown at the business-cycle frequencies, but now most 
of the seasonal power is at the fundamental seasonal band and very little is left at the har- 
monic bands. The contribution of the irregular variations is smaller than that of TU. 
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Figure 3. Job Losers Series 
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Figure 4. Spectrum of Job Losers 
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Figure 5 shows the Job Leavers series, characterized by two troughs, one in the winter 
months and the other during the summer. Its spectrum is given in Figure 6. This series has 
more cyclical variations than trend as indicated by the high peak at 0.022 cycle/month which 
corresponds to a 45 months-cycle. Furthermore, the seasonal variations are highly concen- 
trated around the first harmonic band, supporting the fact that this series has two seasonal 
troughs. Finally, the contribution of the irregular to the total variance is larger than that 
observed for the two previous series. 


Thousands 
215 
250 


225 


200 


150 


125 


Dec. 75 Dec. 76 Dec. 77 Dec. 78 Dec. 79 Dec. 80 Dec. 81 Dec. 82 
Time 
Figure 5. Job Leavers Series 


Power spectrum 
4,000 


3,000 


2,000 


1,000 


je) 


.00 .08 .16 125 33 .42 .50 
Frequency (Cycles/Month) 


Figure 6. Spectum of Job Leavers 
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2.2 The Unemployment Insurance Beneficiaries (UIB) 


The monthly data for Unemployment Insurance Beneficiaries cover all persons drawing 
benefits for a specific week, namely the week of the LFS. This is not a sample since it in- 
cludes the total population of beneficiaries. The UI covers virtually all paid workers in the 
labour force and members of Armed Forces. The main exceptions are: 

— People 65 years of age and over; 

- People working fewer than 15 hours weekly; 

- People earning less than 20% of the maximum weekly insurable earnings (in 1982, it 

was $70). 


In order to qualify for benefits, a claimant must be available for and capable of work, 
unable to find suitable employment and have the necessary qualifying requirements. Previously 
eight weeks of work was the minimum required to qualify for benefits but as of December 
1977 this number varied between 10 and 14 weeks according to the rate of unemployment 
prevailing in the region of residence of the claimant. Benefits are paid after a two-week period 
has been served. 

Claimants who qualify for benefits can receive up to 25 percent of their benefits in earn- 
ings and continue to receive UI. However, the LFS considers these individuals to be employed. 
In order to assess the relationship between UI beneficiaries and unemployment, it is thus 
more accurate to use the series of UI beneficiaries without earnings. This subset of UI 
beneficiaries is a fairly consistent and significant proportion of the total LFS count of the 
unemployed. We must note, however, that because of differences in definition, the follow- : 
ing groups are counted as unemployed in the LFS but are not included in the UI records, 
namely, entrants and re-entrants; all individuals who have worked but not long enough to 
qualify for benefits; and those unemployed persons who were previously self-employed. On 
the other hand, persons insured under the UI program can receive benefits even though, under 
the LFS definition they would not be classified as unemployed, examples include self-employed 
fishermen during the off-season, women on maternity leave and employees away from work 
due to sickness or disability. 

The UI beneficiaries (without earnings) series is a sensitive indicator of labour market 
economic conditions. It is reflective of the insured labour force with recent work experience. 

The original Unemployment Insurance Beneficiaries series, as shown in Figure 7, displays 
large seasonal fluctuations with a peak during the winter months, when bad weather curtails 
outdoor work in such industries such as fishing, construction and lumber, bringing a sharp 
rise in claims filed by affected workers. 

Figure 8 shows the spectrum of the UIB series. Very high power is shown at the frequency 
0.0167 cycle/month, which corresponds to a 60 months-cycle, and at those frequencies 
associated with the fundamental seasonal band. The contribution of seasonal variations to 
the total variance of the series is much larger than that observed in TU and its two major 
components. Finally, there is little irregularity relative to the trend-cycle and the seasonal 
components. 


3. PAIRWISE RELATIONSHIPS BETWEEN UNEMPLOYMENT 
INSURANCE BENEFICIARIES, TOTAL 
UNEMPLOYMENT, JOB LOSERS AND JOB LEAVERS 


Several early Canadian studies (e.g., Grubel et a/. 1975; Green and Cousineau 1976; Jump 
and Rea 1975; and Siedule et al. 1976) support the general conclusion that unemployment 
has tended to shift upward with the increased availability of unemployment insurance in 1971. 
Lazar (1978) shows that the 1971 changes increased the unemployment duration and induc- 
ed higher rates of job leaving, especially of young persons and adult women. These studies 
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Figure 7. Unemployment Insurance Beneficiaries Series 
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Figure 8. Spectrum of Unemployment Insurance Beneficiaries 
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were made before the changes of 1975 that aimed at strengthening work incentives. It was 
expected that the changes introduced after 1975 would reverse the effects of the program 
on total unemployment. 

In this section, we carry out an exploratory analysis by searching for pairwise temporal 
relationships between Total Unemployment, Unemployment Insurance Beneficiaries, Job 
Losers and Job Leavers. The existence of these relationships will be useful to build a 
multivariate time series model to explain the joint dynamic behaviour of the above variables. 

The pairwise relationships between TU, UIB, JLo and JLe are calculated using the cross- 
correlations of the residuals or innovations from ARIMA models (Box and Jenkins 1970) 
that fitted well the data. It has been rightly argued by several authors (e.g., Pierce and Haugh 
1977) that the cross-correlations between white noise residuals obtained with different filters 
are biased to accepting the nul hypothesis of independence when it does not exist. Pierce 
and Haugh (1977) suggest to use dynamic regression models. This, however, implies that 
we have to make a judgement on which variable is the cause and which is the effect. At this 
stage, we are simply interested in determining whether there is a temporal relationship in 
each pair of variables analyzed. Table 1 shows the ARIMA models fitted to each series, their 
parameter values estimated with unconditional least squares, the results of the portmanteau 
test (Ljung and Box 1978) and the residual variance. 

The Q statistics values accept the null hypothesis of randomness of the residuals in each 
case. However, since this test is applied to a set of autocorrelations of residuls for various 
lags, it is possible to have significant autocorrelation for some particular time lag k that will 
not be detected by this test. Therefore, we also tested whether there was autocorrelation of 
the residuals for each time lag. We used a more accurate approximation for small samples 
than 1/N to test the variance of the autocorrelation, that is, (N — |k|)N~’ as given by 
Haugh (1976). 

Having obtained satisfactory results from the above models we calculated the cross- 
correlation 7,,(k) between the series analyzed. The S¥, statistic (Haugh 1976) is applied to 
test the independence between the series. Under the assumption that the residuals are nor- 
mally distributed and that E[7,,(k)] = 0 and Var [7,,(K)] = (N — |[K|)N ~*, the statistic 


St = N? ¥ (N— [kA CK? 
M ce xy 


follows a X° distribution with 2M + 1 degrees of freedom. In order to determine the direc- 
tion of the pairwise relationships, we modified the S¥, statistics which is calculated for 
positive or negative k only, excluding zero. 
Table 2 presents the estimates of the cross-correlation between Unemployment Insurance 
Beneficiaries (UIB) and Total Unemployment (TU) and its two major subcomponents Job 
Losers (JLo) and Job Leavers (JLe). We indicate with (a) and (b) those values significant 
at a 5% and 1% confidence level. In the case of UIB and JLo we calculated S$*, for positive 
and negative values of k from +1 to +6 and from +1 to +2 to determine whether there 
is a dominant unidirectional relationship. The results indicated that there is no dominant 
direction between the two variables but a feedback process. 
We can summarize the results from Table 2 as follows: 
(1) There is indication of a unidirectional relationship between UIB and TU such that 
UIB would lead TU by one month; 

(2) There is a feedback between UIB and JLo with a strong instantaneous relationship. 
Taking into consideration the time lag between the two variables, the feedback pro- 
cess seems to be initiated by JLo at lag 2. 


Survey Methodology, December 1985 


Unemployment Insurance 
Beneficiaries (UIB) 


Total Unemployment 


(TU) 


Job Losers 
(JLo) 


Job Leavers 
(JLe) 


Table 1 


Univariate ARIMA Models 


FS ee Se ee eee 


Series 


(1 
(1 


( 
(1 


( 
(1 


d 
ad 


ARIMA Models Q(24) 
— 0.68B)AA” log,)UIB, = 11.55 
— 0.80B”)a, 
— 0.25B°)AA” logig TU, = 9.13 
— 0.84B'*)a, 
— 0.31B°)AA” logy JLo, = 15.78 
— 0.67B")a, 
— 0.37B*)AA” log, JLe, = 14.58 


— 0.40B — 0.25B7) (1 — 0.87B")a, 


153 


0.000395 


0.000604 


0.000627 


SS ee Be a | ee Ge, 


Table 2 


Cross-Correlation Between Unemployment Insurance Beneficiaries and Total Unemployment 
and Its Two Major Components, Job Losers and Job Leavers 
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(3) There is a unidirectional relationship between UIB and JLe such that UIB would lead 
JLe by 2 months. We observe, however, the effect of a delayed feedback at lag 6 which 
arises from the fact that the JLe series have a strong secondary peak in summer as 
shown in Figure 6. 

(4) Finally, there is a strong instantaneous and undirectional relationship between JLo 
and TU such that JLo would lead TU by 3 months. 

The above observations lead to the following Diagram 1 which will be useful for the iden- 

tification of a more complex multivariate time series model that also takes into account the 
partial associations among the variables. 


Diagram 1 
3 months 
JLo ——_______ 2 months. ———__—-» _ UIB ——————— 1 month ee 
Siarssticsate ae pasre aia abet 


2 months 


tag JLe 


4. BUILDING A MULTIVARIATE TIME SERIES MODEL FOR 
UNEMPLOYMENT INSURANCE BENEFICIARIES, 
TOTAL UNEMPLOYMENT, JOB LOSERS AND JOB LEAVERS 


In the previous section we concluded that there are pairwise relationships among the four 
variables in the sense defined by Granger (1969) and Pierce and Haugh (1977). Taking into 
consideration those preliminary relationships, we here identify and estimate two multivariate 
time series models following the methodology developed by Tiao and Box (1981) and Tiao 
and Tsay (1983). These models will explain the joint dynamic behaviour of the variables 
involved. 

A vector ARMA model for seasonal series takes the form 


$(B)HB)Z, = 0(B)O(B')a, (4.1) 
where 

O(B) = 1 — ¢,Bo> ... — $,B” (4.2) 

b( Bi) tt bas FBS eek cho. Bi (4.3) 

6(B) = 0,8. — 0B! (4.4) 

Q(B) = 1 - 0,B° — ... — QB? (4.5) 


are matrix polynomials in B (the back shift operator which is defined by B”Z, = Z,_,,), the 
¢’s, ®s, 6’s and @’s are kx k matrices, s is the seasonal periodicity and a, is a sequence of 
random shock vectors JJD N(0, &) and Z, is a vector of stationary time series. 


Survey Methodology, December 1985 15 


In order to avoid a problem of multicollinearity between TU and JLo, two VARMA models 
were specified, a VARMA (1,2)(0,1),, that relates Unemployment Insurance Beneficiaries 
with total Unemployment, and a VARMA (2,6)(0,1),, that relates UIB with Job Losers and 
Job Leavers. These models were identified and estimated using the exact maximum likelihood 
method in the Scientific Computing Associates program (Liu and Hudak 1983). The models 
are fitted respectively to the original data transformed as follows: 


uib, , UIB 
=+ (Ieee y (Ie 
| ial 8) ae 1 77 (4.6) 
and, 
uib, UIB, 
glo, \= BB Og | JLo, (4.7) 
ile, JLe, ; 


Table 3 shows the parameter values of the VARMA (1,2)(0,1) model and the standard 
errors of estimates given in parenthesis. (The estimated parameter values and the variance- 
covariance matrix of the residuals shown in Table 3 cannot be compared with the one of 
the univariate models (Table 1) because the former result from the fit of the model to the 
standardized transformed data instead of the non-standardized as it was the case with the 
univariate models.) Examination of the pattern of the cross-correlations of the residuals in 
Table 4 suggests that the model is adequate. A plus (minus) sign is used when the estimate 
is greater (less) than twice its standard error and a dot fora non-significant value based on 
the above criterion. 

Thus, the VARMA model for UIB and TU becomes, 


uib, = 0.669uib,, + G, — 0.794 dar (4.8) 


tu, = 0.475uib,_,; — 0.347tu,_, + dy — 0.30844 _» (4.9) 


— 0.7054 2-12) + 0.217 G4 _ 14 


Table 3 
Estimated Parameters for the Transformed UIB and TU Variables 


1 @, 9 
0.669 = - - 0.794 - 
(0.089) (0.090) 

0.475 — 0.347 - 0.308 - 0.705 
(0.098) (0.115) (0.116) (0.086) 
ee a a ee ee ee eee ee 

» b 


0.429249 =~ 


0.131532 0.544389 
tN 
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Table 4 
Cross-Correlation Matrices of the Residuals in Terms of +,—, and - 
wi Smeal ae a a BSE RE tN OO DIE Ba 


LAGS 1 THROUGH 6 


LAGS 7 THROUGH 12 


LAGS 13 THROUGH 18 


LAGS 19 THROUGH 24 


Equations (4.8) and (4.9) indicate that Unemployment Beneficiaries leads the Total 
Unemployment series by one month. In fact, when analyzing the relationship between UIB 
and TU we must keep in mind that an increase in JLo and thus an increase in UIB may lead © 
other members of the family to look for work in order to compensate for the loss of income. 
These are the new entrants and re-entrants who do not qualify for insurance benefits but con- 
tribute to an increase in TU. Furthermore, we should note that it is possible to have an in- 
crease in Total Unemployment without an increase in the normal gross flow of labour markets, 
simply because an increase in UIB occurs during recessionary periods where the availability 
of jobs is significantly reduced and thus flows into the unemployment state will increase. 

The results of this model are in agreement with the preliminary results obtained from the 
pairwise cross-correlations of the previous section as shown in Diagram 1. The model, | 
however, provides us with a more complete information on the dynamic behaviour of these 
two phenomena. We observe that the Unemployment Insurance Beneficiaries series is positive- 
ly related to its previous-month level whereas the Total Unemployment is positively related 
to the previous-month level of UIB and negatively related to its previous-month level. In 
both equations, the effect of seasonality is reflected in their moving average part with a high 
parameter value for the random shock at lag 12. 

Table 5 shows the VARMA (2,6)(0,1),, model applied to the transformed UIB, JLo and 
JLe variables as given in System (4.7). 

Table 6 indicates no recognizable patterns in the estimated cross-correlation matrices of 
the residuals and, therefore, this model is considered adequate. 

The final vector ARMA (2,6))0,1),, model for the three variables is, 


uib, = 0.617uib,_, + 0.268jl0,_, + Gy (4.10) 
+0: 201dy ar 0-83 1djg-4yf" 0-1 60,0 

jic, = O577Wibeawe O-285ilon. + Gx. (4.11) 
ENO. 3800 0 05 250g, 15) 80. 308a ais 

jle, = 0.303uib,, — 0.411 jle,_, — 0.403jle,_, (4.12) 


ar (1) ron 0.79743 ¢,_ 12). 
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Table 5 
Estimated Parameters for the Transformed UIB, JLo and JLe Variables 
di d, 96 
0.617 - - - 0.268 - ~ ~ — 0.221 
(0.086) (0.080) (0.087) 
0.577  -—0.285 - - - - - ~ — 0.386 
(0.099) (0.096) (0.108) 
- - —0.411 0.303 - — 0.403 - - - 
(0.088) (0.083) (0.084) 
R 2 
Oi. L 0,,¢ = V2, 5303) 
0.831 = - 0.339 - - 
(0.094) 
- 0.525 - 0.117 0.483 - 0 
(0.096) 
~ - 0.797 0.014 0.153 0.428 
(0.077) 


a EE es ee 


Table 6 
Cross-Correlation Matrices Terms of +,—, and: 


LAGS 1 THROUGH 6 


LAGS 7 THROUGH 12 


LAGS 13 THROUGH 18 


LAGS 19 THROUGH 24 
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Equation (4.10) and (4.ll) shows the existence of feedback between Job Losers and 
Unemployment Insurance Beneficiaries similar to the relationship found in section 3. The 
JLo series leads UIB by two months (equation 4.10) and the one month lagged UIB strongly 
affects the current value of JLo (equation 4.1l). Furthermore, each of the two endogeneous 
variables UIB and JLo are affected by their previous-month levels, positively in the case of 
UIB and negatively in the case of JLo. The relationship between both series due to seasonality 
is reflected by the parameter values of a,_, and a,_,,. The need for a moving average term 
at lag 6 arises from the fact that the JLe series have a strong secondary peak in summer 
as shown in figure 6. 

These empirical results are not in contradiction with economic theory. It has been argued, 
with good reason, that causality cannot be detected only from empirical evidences but must 
be supported by economic theory (see e.g. Zellner 1979). It is easy to accept that an increase 
in Job Losers which is associated with an economic recession will lead to an increase in 
Unemployment Insurance Beneficiaries. In turn, an increase in Unemployment Insurance 
Beneficiaries will lead to an increase in Job Losers because in reaction to a severe economic 
recession, most firms make temporary layoffs to be able to have their employees back when 
economic conditions improve. 

Equation (4.12) raises an interesting question when showing that Unemployment Insurance 
Beneficiaries leads the Job Leavers by two months. It is not so evident why this should be 
the case. 


Plausible explanations can be found in the analysis of the shortrun dynamics of the Cana- 


dian labour markets and a thorough investigation would require longitudinal data. We can, 
however, entertain the hypothesis among others that an increase in JLo and thus an increase 
in UIB may lead other members of the family to look for work in order to compensate for 
the loss of income. These persons are the new entrants and re-entrants. During a recession 
when JLo is increasing it is very difficult for new entrants and re-entrants to find a job. 
These new entrants and re-entrants are mainly young people and women over 25 who are 
willing to accept any job, at first, as long as it means extra income for the family. They 
might work for the length of time necessary for them to qualify for benefits. Then, once 
they qualify for benefits, they would become JLe in order to be more selective in the kind 
of job they will accept. 


5. CONCLUSIONS 


The main purpose of this study has been to assess whether there are temporal relation- 
ships between Unemployment Insurance Beneficiaries (UIB) and Total Unemployment (TU), 
Job Losers (JLo) and Job Leavers (JLe) by building dynamic multivariate time series models. 

We have first carried out an exploratory analysis by searching for pairwise temporal rela- 
tionships between TU, UIB, JLo and JLe in the sense defined by Granger (1969) and Pierce 
and Haugh (1977). Our results indicated the existence of relationships among the four variables 
involved. 

We have then identified and estimated two multivariate time series models following the 
methodology developed by Tiao and Box (1981) and Tiao and Tsay (1983). The results of 
the vector ARMA models agree with the preliminary results obtained from the pairwise cross- 
correlations of the residuals of the univariate ARIMA models. 

The first vector ARMA model shows that the UIB series leads TU by one month. UIB 
is also positively related to its previous-month level whereas TU is negatively related. 

The second vector ARMA model shows that JLo leads UIB by two months with the ex- 
istence of a one-month feedback from UIB to JLo. Furthermore, UIB is posivitely affected 
by its previous-month level while JLo is negatively related. It also shows that UIB leads JLe 
by two months. 
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These empirical results based on data for 1975-82 are not in contradiction with economic 
theory. Furthermore, they conform to those of earlier Canadian studies, based on data prior 
to 1975, which supported the general conclusions that the increased availability of unemploy- 
ment insurance induced higher rates of job leaving, especially of young persons and adult 
women and led to increased levels of unemployment. Hence, it seems that the UIC regula- 
tion change in 1977 had little effect, if any, in this regard. 

It would have been very interesting to assess the effect of the high recession that started 
in July 1981 but given the series length, elimination of this recessionary period would have 
made the series too short for any sound statistical modelling. 
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Basic Principles of Questionnaire Design 


LARRY SWAIN! 


ABSTRACT 


Thirty basic principles of questionnaire design are presented covering the content, wording, format, 
and testing of questionnaires. The extent to which the questionnaire is an integral part of the survey 
is emphasized as is consideration of its relationship with other aspects of survey design. 


KEY WORDS: Survey; Questionnaire; Methodology. 
1. INTRODUCTION 


Most surveys make use of a questionnaire which is to be completed by either a respondent 
or an official representative of a survey organization (by personal contact or telephone). Since 
the questionnaire is the means by which the objectives of a survey are transformed into 
measurable variables, successful achievement of those objectives requires an effective ques- 
tionnaire. In addition, the questionnaire may help structure, standardize, and control the 
data collection process so that the required information is obtained in a satisfactory man- 
ner. Effective questionnaire design is a combination of basic principles and common sense, 
adapted to the particular needs of each individual survey. 

Although thirty separate principles of questionnaire design are presented, they are not 
intended to be seen as independent of each other or of the survey environment in which they 
operate. The extent to which the questionnaire is an integral part of the survey process can- 
not be sufficiently emphasized. As the questionnaire cannot be designed in isolation from 

the various other aspects of the survey, the reader is also advised to consider during ques- 
tionnaire design its relationship with survey objectives, population, data collection, coding 
and data capture, editing, imputation, confidentiality, and testing. 

Since this paper is not intended to be a comprehensive discussion of either survey or ques- 
tionnaire design, alert readers, depending on their own perspectives of the various aspects 
of a survey, may identify omissions in the principles or may wish to exclude particular prin- 

ciples as more appropriate to a survey component other than questionnaire design. 

The basic principles of questionnaire design as presented cover the content, wording, for- 
‘Mat, and testing of questionnaires. The questionnaire has a major impact on whether or not 
the survey objectives are met. Unlike other major survey components such as sample design 
or data processing procedures, the questionnaire directly involves the respondent. Therefore, 
‘it is essential that the content, wording, and format ensure the collection of reliable, valid, 
and relevant information from the respondent. 

The author recognizes that although some of the principles appear obvious when stated, 
they are usually not so in practice. Also, some of the principles are measurable; some are not. 
_ In the principles which follow, the term questionnaire is consistently used to refer to the 
various types of forms used to obtain information. In the literature and in practice, distinc- 


tions are often made among: 


(a) a questionnaire (completed by a respondent); 


i Larry Swain, formerly of the Census and Household Survey Methods Division, Statistics Canada; currently of 
_ the Human Resources Planning Division, Public Service Commission, Ottawa, Canada K1A 0M7. 


162 Swain: Questionnaire Design 


(b) an interview schedule (completed by an interviewer); 

(c) an administrative form (completed by a respondent or an official representative of the 
survey organization); 

(d) a form used to record observations or measurements (completed by an official represen- 
tative of the survey organization); 

(ec) a form used when transcribing information from existing administrative records (com- 
pleted by an official representative of the survey organization). 


For simplicity, the term questionnaire is used herein to represent all such forms. In addi- 
tion, the term questionnaire item is used to represent the particular question or statement 
requesting information, including the response categories or space for response. 

The term survey is used generally to represent any data collection activity, including sam- 
ple surveys, censuses, and administrative data collection. 


2. CONTENT 


1. All questionnaire items should be directly related to the objectives and uses 
of the survey. 


It is a reasonable goal that the collection of information be designed to minimize response 
burden by techniques such as reducing the number of questions. Exclusion of questionnaire items 
only remotely related to the objectives and uses of the survey is a means of satisfying this goal. 

In addition, questionnaire items that ask for irrelevant information unnecessarily contribute 
to the overall length of a questionnaire and may provoke suspicion in respondents, factors 
which may lead to increased non-response rates (a possible source of bias), to a poorer quality 
of data because of fatigue or lack of concentration by interviewers or respondents, and to 
increased costs, both financial and temporal, to the survey sponsor and to the respondents. 

For the questionnaire designer, the very act of relating each questionnaire item to the survey 
objectives and uses helps ensure that these objectives and uses are well defined and will indeed 
be satisfied by the questionnaire. 


2. If a questionnaire contains items that, although relevant to the survey, may 
not appear so to respondents, then an explanation of the reason for their in- 
clusion should be provided to respondents. 


Classification variables such as age, sex, marital status, size of organization, number of 
employees and variables such as name, address, and telephone number (used for follow-up 
procedures or for editing purposes) are possible examples where an explanation to respondents 
should be considered for inclusion (at least at a general level). 


3. Only those questionnaire items for which responses can be provided easily and 
with sufficient reliability should be included. 


Where information is requested through recall by respondents, the events should be suffi- 
ciently recent or familiar to the respondents; where the request can be satisfied from available 
records maintained by respondents,the effort (including both time and cost) required to obtain 
the information should not exceed the benefits to be gained by acquisition of the information. 


Because of potential definitional ambiguities, increased response bruden, and processing 
errors, it may be advisable that respondents not be asked to process information to complete 
a questionnaire item. It may be easier and more accurate for respondents to be asked for the 
specific information already available to them, to be processed later by the survey organization. 
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4. Respondents should not be asked questionnaire items for which they cannot 
be expected to provide any response. 


Questionnaire items should not presume that the respondent has knowledge or awareness 
of a specific topic or engages in a particular activity. Filter questions can be used to exclude 
a respondent from a subsequent questionnaire item or sequence of items if those items are 
irrelevant because of the respondent’s own particular characteristics, circumstances, or 
opinions. 

Should respondents encounter many irrelevant items, they may feel that the survey ques- 
tionnaire had been given to them in error. This could contribute to non- -response or to poor 
relations with respondents. 

The use of a filter question also serves to identify clearly whether or not a respondent 
is required to answer a subsequent questionnaire item or sequence of items. This is useful 
during survey processing and subsequent analysis. If a response to a questionnaire item is 
blank, it may be difficult to distinguish between the situation in which the reason for the 
blank is non-response (a refusal or an accidential omission), and that in which it is because 
the question does not apply (in the case of a numerical answer, the question may seem not 
to apply when the answer is legitimately zero). A filter question helps resolve this problem 
by identifying which respondents should have answered the questionnaire item. 

Complex skip patterns, however, should be avoided, especially for those questionnaires 
completed by respondents themselves. Also, the number of filter questions should be 
minimized. 

For items requiring a numerical answer, an alternative to a filter question is the inclusion 
of a None category. 


3. WORDING 
1. The phrasing of a questionnaire item should be appropriate to the respondent. 


If a respondent does not understand a questionnaire item, it is probable that the response 
to that item will be inaccurate or not be given. Words, phraseology, and sentence structure 
familiar and appropriate to those providing the information should be used. 

Abbreviations should be avoided unless they are understood by respondents. 


2. Where there is sufficient demand, questionnaires should be translated into other 
languages. 


Steps should be taken to ensure that the translated version corresponds adequately to the 
Original version with respect to the intended meaning. 


3. The questionnaire designer should choose the type(s) of questionnaire items 
most appropriate to obtain the required information while minimizing the 
response error and response burden in obtaining that information. 


7 


The types of questionnaire items for consideration are the open-response or free-answer 
_type, the closed-response or fixed-answer type, and the fill-in-the-blanks type. Closed-response 
types are those items for which answer categories are provided. Fill-in-the-blanks types, 


although they appear to be open-response because no answer categories are explicitly provided, 
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are actually implicitly closed from the respondent’s point of view in that the choice of answers 
is usually limited to a number, a day of the week, a province, etc. 

Generally, closed-response questions entail less respondent and/or interviewer burden, 
since they do not require respondents to formulate and answer in their own words nor do 
the answers have to be recorded verbatim. 


4. When a decision among two to more well-defined alternatives is required, a 
closed-response or fill-in-the-blanks type of questionnaire item should be used. 


When all the alternatives are too numerous to be listed, then the use of the category other 
to represent a number of infrequently occurring responses, the use of an open-response or 
fill-in-the-blanks type of questionnaire item, or the collapsing of alternatives into fewer 
categories is recommended. In fact, it may be appropriate to use a fill-in-the-blanks type, 
where the response categories and numerical codes are included in a separate instruction 
booklet accompanying the questionnaire. In business, agriculture and institutional surveys, 
fill-in-the-blanks types of items are frequently used for questions that require a numerical 
response. The choice and number of categories in a closed-response type depends on the com- 
plexity of interpretation of the concept, the uses to which the data will be put, and the prior 
information available to the questionnaire designer. 


5. When the alternatives to a question are not well-defined, an open-response 
type of questionnaire item should be used. 


Open-response types of questionnaire items are frequently used in preliminary research 
or exploratory studies to generate specific hypotheses and to structure items for subsequent 
questionnaires. The open-response type of questionnaire item may also be used as a means 
of probing for additional or qualifying information, for purposes of verification of other 
questionnaire items, for use in interpreting data, as a change of pace, or as an introduction 
to a new topic. 


6. If ease, timeliness, and cost of processing the data for capture are important 
considerations, closed-response types of questionnaire items should be used. 


Open-response types of questionnaire items require coding of the information provided, 
an operation which can be both costly and time-consuming and is also subject to errors of 
interpretation and procedure. 

In addition, with open-response types of questionnaire items, no specific frame of reference 
is provided, leading to the choice of varying frames of reference on the part of respondents. 
These varying frames of reference and the provision of varying amounts of information by 
respondents cause difficulty in the recording, coding and analysis of responses. On the other 
hand, a closed response provides a specific frame of reference, which although avoiding the 
above problems, may artificially induce a response. This is especially true when the repon- 
dent has little or no information or opinion about a particular topic. The questionnaire 
designer must therefore be aware of the possible frames of reference of respondents before 
choosing a type of questionnaire item. 

Once the type and wording of a questionnaire item have been decided upon, restrictions 
are placed on the uses to which the information can be put, the specific hypotheses which 
can be tested and the analyses that will be applied to the item. This implies that the deter- 
mination of objectives, uses, hypothesis testing and analyses is a prerequisite to the final 
version of the item. This determination does not preclude that the data will suggest addi- 
tional analyses and uses within the limits imposed by the questionnaire items themselves. 
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In addition to the above considerations, the past experience of the questionnaire designer 
will contribute to the choice of suitable type(s) of questionnaire items in particular situations. 


7. Response categories for closed-response types of questionnaire items should 
be non-overlapping and exhaustive (that is, mutually exclusive and 
comprehensive). 


The response categories of a particular questionnaire item should be distinct and include 
all possibilities. 

The distinctiveness of response categories does not preclude the applicability of more than 
one response to a particular questionnaire item. In such a case, a note such as check as many 
as apply should be included as part of the question. 

Where response categories are such that only one response is to be provided to a par- 
ticular questionnaire item, a note such as check one item only should be included as part 
_ of the question (except in the most obvious cases, for example, where the response categories 
are Yes and No). In those cases where more than one response can be applicable but where 
the designer wishes that only one item be checked in order to restrict responses, a note such 
as check the most appropriate item should be provided. 


8. The units of response should be specified. 


Either the units of response (e.g., kilograms, tons, per cent, hours per week) should be 
included in the questionnaire item or the respondent should be asked to specify them. Other- 
wise, there may be ambiguity as to which units were actually used. 


9. Standardized concepts and definitions should be used. 


To facilitate comparison of survey data with other sources of information (publications, 
other surveys) and to maximize the usefulness of the data (including secondary analysis), 
standardized (commonly understood and used) definitions should be used where they exist, 
and are well-defined, appropriate and up-to-date. Statistics Canada publishes standards related 
to occupational classes, industrial classes, commodities, geography and specific social con- 
cepts. In addition, Census concepts and categories are frequently used as standards. 


10. The wording of questionnaire items should be specific, definitive, consistent, 
brief, simple and self-explanatory. 


Survey concepts and terms that are new to repondents or subject to misinterpretation should 
be explained, defined, or avoided. To ensure consistent interpretation, the proper frame of 
reference (e.g., time reference, location, category of expenditure) should be provided. If con- 
sistency is required (e.g., different time references for different items), the change should 
be highlighted in the questionnaire. 

Where several words can be used interchangeably, one of these should be selected and 
used throughout the questionnaire. If a synonym of a word already encountered is used in 
its place, respondents and others may assume that a different meaning is intended. 


11. Double-barreled questions should be avoided. 


A double-barreled question allows the respondent to make only one response although 
it is actually two questions in one. From the response, it is not possible to discern which 
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of the two ideas was answered or whether both were answered. The two issues should be 
asked separately except in specific circumstances where two issues necessarily have to asked 
together to convey the proper meaning. In such a case, it should be made clear to respondents 
that the two issues are both to be considered together. 


12. Leading questionnaire items should be avoided. 


A leading questionnaire item is one that is worded or formatted in such a way as to in- 
duce a respondent to choose a particular alternative or set of alternatives. 

Some questions can be considered to be leading if they present options that may be perceived 
by respondents as socially unacceptable without an assurance that the respondent is made 
to feel that there would be no stigma attached to their response. 

In attitudinal surveys, two basic principles have evolved to reduce (but not necessarily 
eliminate) response bias. The distribution of alternative answers should balance to provide 
approximately as many positive answers as negative to avoid leading respondents in one direc- 
tion. Secondly, where there exists a series of items that have the same response alternatives, 
the sequence of items should either contain a mixture of positive and negative statements, 
be broken up, or be presented in a varied order to reduce the incidence of respondents answer- 
ing in the same manner throughout the sequence (even though it may be inappropriate), 
without thinking very carefully about the particular responses. 


4. FORMAT 


1. Every questionnaire or questionnaire package should contain explanatory in- 
troductory material. 


2. The introductory material should state the title of the survey, the name(s) of 
the sponsoring institution(s), and the purposes(s) of the survey. 


3. An assurance to respondents of the confidentiality of the data that they pro- 
vide should be considered. 


4. The name (if appropriate) and telephone number or postal address of a con- 
tact within the sponsoring institution(s) should be included on the question- 
naire in order that respondents may obtain additional information related to 
the survey, should they require it. 


Generally, the introductory material may be in the form of a letter or brochure sent to 
the respondent; it might be a prepared statement made by an interviewer; or it can appear 
on the questionnaire itself. The introduction contains essential background information to 
respondents for the purposes of identification, legitimacy and notification of legal rights (if 
applicable). 


5. Suitable identification should appear on the questionnaire. 

For the purposes of estimation, field control, linkage with other records or follow-up on 
non-respondents, appropriate identification (numerical or otherwise) should be included on 
the questionnaire. 

6. Questionnaire items and pages should be numbered. 
To facilitate administration by interviewers, completion by respondents, and coding opera- 


tions and instructions, questionnaire items and pages should be numbered consecutively (using 
either letters or numbers) throughout the questionnaire. If questions are written on 
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both sides of a page, an instruction (e.g., over) should appear at the bottom of the first side 
to ensure that the questions on the second side are completed. 


7. The print on the questionnaire should be such that it can be easily read by 
the average respondent. 


The person completing the questionnaire must be considered when determining the size 
of the type face (for example, small print could cause problems for those with poor eyesight) 
and the colours and contrasts of paper and type to be used. It is usually advisable to have 
different type face (size or type of characters) used for questions and instructions so that 
they can be easily distinguished. 


8. Instructions for completion should be included on or with the questionnaire. 


To help ensure that the questionnaire is completed properly by respondents, interviewers 
or other officials, brief but clear instructions should appear on or with the questionnaire 
(e.g., in an interviewers’ manual or an instruction manual). However, questionnaire items 
should be as self-explanatory as possible to avoid complex sets of instructions. 

For questionnaires being read by an optical character reader, clear instructions should 
be provided to help ensure their proper completion. 

Instructions to respondents or interviewers for skipping items following filter questions 
should be sufficiently obvious and easy to follow. The use of arrows and directions may 
be appropriate. Complex skip patterns should be avoided, especially for questionnaires com- 
pleted by respondents themselves. 


9. The instructions for return procedures should be included on the questionnaire. 


For a questionnaire which is to be returned by mail, the name and address of the person 
(or organization) to whom it is to be returned should be included on the questionnaire itself. 
Introductory letters and return envelopes can easily be mislaid or separated from the main 
body of the questionnaire. 

The deadline by which respondents are to return completed questionnaires should also 
be stated. 

For a questionnaire which is to be picked up by a field representative, space for the name 
and telephone number or postal address where the representative can be contacted and the 
date and approximate time of pick-up should be included on the questionnaire. 


10. The numerical fields and codes used for data capture purposes should appear 
on the questionnaire (when capture is to be directly from the questionnaire). 


When appropriate, data may be captured more quickly with fewer errors directly from 
the questionnaire itself. In such a case, the numerical fields and codes should be easily read 
by those performing the data capture but should not be a distraction to the respondent, in- 
terviewer or other official completing the questionnaire. 

When data are to be coded before data capture, the coding boxes may appear on the ques- 
tionnaire or on a separate sheet. When coding boxes do appear on the questionnaire, they 
should be clearly distinguished from answer boxes, perhaps with the Office Use Only designa- 
tion or through appropriate shading. 

Coding and data capture are often considered as steps that follow questionnaire design. 
It is essential for efficient implementation that they be considered during questionnaire design. 
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11. The format of answer spaces should be consistent throughout the question- 
naire, with sufficent spacing for purposes of readability and accommodation 
of the responses to the questionnaire items. 


Consistency of layout for response facilitates the task of a respondent, interviewer or of- 
ficial and aids in reducing error caused by inadvertent omission of a questionnaire item, an 
incorrect response, or a transposition of responses. 

It may be useful to use different shapes for check-off type answers and numerical answers. 
One convention sometimes used is circles for the former and boxes for the latter. 

There should be generous spacing on the questionnaire: to facilitate administration; to 
make the questionnaire more attractive and readable; and to provide the respondent, inter- 
viewer or official with sufficient space for the response to the questionnaire item. 


12. Questionnaire items should be sequenced in a logical order for ease of com- 
pletion and to provide the proper frame of reference. 


The sequence of questionnaire items should appear logical to the respondent (a logic that 
may be different from that of the questionnaire designer), with questionnaire items related 
to one another grouped together. One sometimes recommended method is to have questions 
proceed from the most general questions to the most specific. Question ordering should try 
to anticipate the order in which respondents will supply information. The questionnaire | 
designer should recognize that a question may prompt an answer not only to that question 
but also to another question which (hopefully) follows very shortly. 

Transitions between sections of questions should be smooth. Section headings or introduc- 
tory statements to sections should be used. For questionaires used in transcription from other 
documents, a logical sequence would be that of the source document. 

In attitudinal surveys, the questionnaire designer should avoid conditioning respondents 
in the early questioning to a frame of reference which could bias responses to later ques- 
tions. For example, questionnaire items regarding the awareness of a concept should precede 
any other mention of that concept. Sensitive questions should be placed within the context 
of related questions so as to justify their inclusion as much as possible and desensitize the 
questions somewhat. 


13. The final version of the questionnaire should contain no typographical or 
grammatical errors. 


The inclusion of errors on the questionnaire may have an adverse effect on data quality 
in that the questionnaire may not be treated seriously or may be misunderstood by those 
completing it. In addition, errors may contribute negatively to the image of the survey 
organization in the eyes of the public. 


5. TESTING 


1. Questionnaires administered for the first time or containing substantial 
modifications should be tested prior to their use as a collection document. 


Just because all principles described in the previous principles have been followed, there 
is no guarantee that the proposed questionnaire will fully satisfy the objectives of the survey 
no matter how conscientious the researcher has been in designing the questionnaire. There 
are almost always unforeseen problems that occur in the administration of a questionnaire. 
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As a result, it is essential that a pretest of the questionnaire be implemented for all new surveys 
and for already existing surveys on which substantial modifications have been made in order 
to determine whether the objectives are likely to be met by the proposed questionnaire. 
Some aspects of the questionnaire that the designer may test are the following: the wording, 
sequence and layout of the questionnaire to determine whether the questions and their flow 
are understood by respondents and interviewers; the necessity for inclusion of particular ques- 
tions; the choice of types of questions; the use of specialized questioning techniques such 
as ranking or rating questions; the structure and definition of response categories; the degree 
of usage of the ‘‘other’’ category in questions; the ease of administration of the question- 
naire; the time to administer various sections of the questionnaire; translation of the ques- 
tionnaire; the possibility of bias in the questions; the nature of ethnic, regional or linguistic 
differences; the reasonableness of the questionnaire with respect to its demands on the respon- 
dent; the suitability of the questionnaire for measuring the concepts on which measurement 
is required; letters of introduction or introductory procedures; and the suitability of the method 


. of collection. 


A pretest should be done on at least a small sample of respondents (usually twenty to 
thirty) from the target population. It is preferable that the respondents be selected from the 
various subpopulations of the target population where differences or problems are likely to 
occur. Possible variables for definition of the test subpopulations are geographic region, educa- 
tional background, age, sex, language, size of firm and type of industry. Depending on the 
particular purposes of the pretest, either a probability or a non-probability sampling scheme 
may be required for the selection of respondents, although in most cases, the latter is employed. 
One possibility is to use a focus group discussion of the questionnaire as a part of the pretest 
procedure. 

The method of collection used for the pretest should be identical to that planned for the 
main survey. However, a personal interview is recommended for at least a portion of the 
pretest respondents so that the interviewer can then record the respondents’ reactions, both 
verbal and non-verbal, as well as their own suggestions and impressions. After each test in- 


_ terview, the interviewer can discuss difficulties that the respondent had, the interpretation 


of questions and response categories, and so on. These difficulties can then be discussed with 
the designer of the questionnaire, for example, in the context of a meeting among the ques- 
tionnaire designer and the pretest interviewers to debrief them on the interviews. For some 
pretests, it may be preferable to use experienced, skilled interviewers in order to maximize 
the usefulness of the pretest. 

The pretest is an often-neglected procedure. It will almost always suggest improvements 
or will at least give the designer some assurance that the questionnaire used in the main survey, 
a much more expensive proposition, will likely proceed fairly efficiently. Of course, there 
is never any guarantee that all problems will be solved, but most major ones should be. A 
pretest need not be expensive and need not require a great deal of time for implementation 


and is recommended for all new or modified questionnaires. 
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Selected Administrative Data Files! 
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ABSTRACT 


Twelve administrative data files are reviewed to determine if some of them could be used to derive 
migration data, in case the universality of the currently used family allowance files be limited, as a 
result of federal legislation. 

It is found that none of the twelve files have strengths and weaknesses strictly comparable to those 
of the family allowance files. Further developments of the Health Care, and to a lesser extent the Old 
Age Security files are highly recommended. 


KEY WORDS: Administrative files; migration; qualitative evaluation. 


1. INTRODUCTION 


In Canada, both family allowance and income tax files have a wide range of utility in 

producing the migration and population estimates for the different geographic areas (see 

_ Statistics Canada Catalogue Nos. 91-001, 91-210, 91-211 and 91-212). Data from the family 

allowance files are made available within 2 to 3 months after the reference date. In contrast, 

income tax data are available within 12 to 15 months after the reference date. However, in- 

come tax data provide the estimates of migration flows for the census divisions, and also 
by age and sex. 

In terms of accuracy of population estimates, both family allowance and income tax files 

are good and they are comparable (see Norris and Standish 1983; Norris 1983; Verma ef 

' al. 1984; Verma and Basavarajappa 1985). One of the special features of the family allowance 

and income tax files is the fact that they are national in character. Another feature is that 

the records contain addresses with the postal codes. Thus, this could provide the migration 

information for local areas. However, in recent years, there seems to be some possibility 

that family allowance could cease to be universal as a result of government legislation. For 

example, coverage might be limited to the lower- and middle-sectors of the population. If 

_ this file ceased to be universal, its utility as a migration data source would be very severely 

_ limited. Hence, our population estimation activities would be jeopardized which in turn would 


affect other programs as revenue sharing, involving the annual distribution of $20 billion 


| among provinces. 

_ For this reason, alternate sources need to be explored. An attempt is made here to assess 
_ the strengths and weaknesses of some of the selected administrative data files for estimating 
_ Migration and population for provinces and territories, census divisions, census metropolitan 
| areas and other regions in Canada. 

_ The twelve administrative files are qualitatively evaluated as an alternative to family 
allowance files. On the basis of their strengths and weaknesses, they are divided into the 


following three groups: 


! Abridged version of the paper presented at the meetings of the Federal-Provincial Committee on Demography 
held on November 28-29, 1985, Ottawa, Canada. 

? Ravi B.P. Verma and Pierre Parent, Demography Division, Census and Demographic Statistics Branch, Statistics 
Canada, 4" floor, Jean Talon Building, Tunney’s Pasture, Ottawa, Ontario, Canada K1A OT6. 
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Major potential files for estimating migration flows 


i) Health Insurance Files 
ii) Old Age Security File 


Major potential files used as a symptomatic indicator of population change and net 
migration 


iii) Hydro Connections 
iv) Telephone Customers 
v) School Enrollments 


Other files with limited or uncertain potential for estimating migration flow/net 
migration 


vi) Driver’s License 
vii Building Permits 
viii) Unemployment Insurance Beneficiaries 
ix) Labor Force Survey 
x) Voters’ List 
xi) Retail Sales 
xii) Trucking Statistics 


1.1 Criteria for Evaluating Administrative Data Files 


The assessment of the usefulness of the various administrative data sources for estimating 
interprovincial and intraprovincial migration is done with respect to ten criteria: universe, 
coverage, method of determining migration information, types of migration, characteristics 
of records, reference date/period (and monthly availability), time-lag, historical availabili- 
ty, consistency and computerization (Almond 1982). 

The new data source would have high potential if it contains features of the family — 
allowance files, as described in Table 1. The most important criteria are: coverage, timeliness, 
consistency, monthly or quarterly availability, disaggregation using the postal code or other 
geocodes. The file or set of files that can meet these standards would probably qualify as 
replacement source to family allowance. 


2. MAJOR POTENTIAL FILES FOR ESTIMATING MIGRATION FLOWS 


Health Insurance and Old Age Security files are major potential files for estimating migra- 
tion flows among provinces, territories and census divisions. Strengths and weaknesses of 
each of these two files are presented below. 


2.1 Health Insurance File 


Health Insurance is a provincial responsibility. Each province thus keeps a file of people 
eligible for the program. All residents in the province (including newly arrived immigrants 
and foreign students) are covered by the provincial insurance, except for RCMP and Armed 
Forces personel, and for the federal penitentiary inmates, covered by the federal govern- 
ment. Everybody who establishes its residence in a province must fill out a proper applica- 
tion, from which data on in-migrants, by province of origin, and on international immigrants 
can be compiled. Virtually complete coverage, monthly availability, minimal time lags and 
information usually detailed by age, sex and family composition of the migrants are the main 
strengths of the files. There should also be a very strong incentive for interprovincial migrants 
to apply to the program. Consequently, migration data should be reliable. 
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Criteria 


Universe 


Coverage 


Method for deter- 
mining status 


Types of migration 


Characteristics 


Reference 
date/period 


Time-lags 


Historical availability 


Consistency 
| 
: 


| Level of 


‘computerization 


: 


Table 1 


173 


Description of the Administrative Data Files Currently 


Used to Derive Migration Data in Canada 
oe ee EES 2 ee ee er ee es er reer ee | ee 2 ee ee 


F.A. Monthly 
Statistics Report 


Children in payment of F.A. 


25% of total population in 1984. 
Virtually 100% of children aged 
0-17 


Compilation of change of ad- 
dress notices 


Interprovincial migration, by 
province of origin and 
destination 


Origin-destination. Age: total 
0-17 only. Family size: refers to 
the number of children in family 


Month: refers to the amount of 
information processed during 
that time 


Data processed a given month is 
available at the end of that 
month and refers to migration of 
approximately two months 
earlier 


January 1974 onwards for 
children migration data. From 
1947 to 1973, only information 
on family migration was 
available 


OVER TIME: change in 1974. 
Slight, problems since 1980. 
AMONG PROVINCES: good 


In provincial offices, yes. But 
data are sent to Health and 
Welfare Canada central office 
on print-outs 


FA. 
M0024 File 


Children entitled to F.A. (as op- 
posed to ‘‘in payment’’) 


Similar to F.A. Monthly Statistics 


Compilation of change of address 
notices 


Similar to F.A. Monthly 
Statistics, plus international 
migration 


Origin-destination. Age: year and 
month of birth. Language (E or 
F). Type of account (regular, 
foster, foreign or agency) 


Month: refers to month of real 
migration 


Data released semi-annually. Con- 
tains information on last six 
months’ migration. Available ap- 
proximately 3 months after end 
of semi-annual version 


December 1977 to present 


OVER TIME: generally good. 
Slight problems since 1982. 
AMONG PROVINCES: problems 
with Ont. since 1982. Slight pro- 
blems with Nfld. and N.S. in 
1983 


Yes, well developed 


Se —n—e  —————_______...un..............._...._. 
Note: F.A. is an abbreviation for Family Allowance. 
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Table 1 


Description of the Administrative Data Files Currently 
Used to Derive Migration Data in Canada (Concluded) 


Neen eee ee eee eee eee ranmnnnsI=IETESEnInEIIEnInSTISIENSIEEEnEEnNIETI 


Criteria 


——————————— 


Universe 


Coverage 


Method for deter- 


mining status 


Types of migration 


Characteristics 


Reference 
date/period 


Time-lags 


Historical availability 


Consistency 


Level of 
computerization 


F55 Program 


Children entitled to F.A. 


Similar to F.A. Monthly 
Statistics 


Symptomatic indicator 


Net migration 


Number of children by 
geographical area. Age 


Twice a year, as of June 1, and 
December 1 (refers to the 
number of children entitled to 
F.A. as of these dates) 


Available approximately three 
months after reference date 


December 1977 to present, with 
entitlement information. 
Available back to 1974 for 
children in payment 


Generally good 


Yes, well developed 


Revenue Canada 
File 


Tax filers (must have filed two 
consecutive years) 


Filers matched two consecutive 
years total up to approximately 
75% of population aged 18 and 
over 


Comparison of the return address 
of matched returns. Correction is 
brought for unmatched returns 


Intraprovincial, Interprovincial 
and International 


Origin-destination. Broad age-sex 
group 


Year: refers to the period between 
two consecutive filings, i.e. ap- 
proximately the April-March 
period. Used as June-May data 


Preliminary available 6-8 months 
after end of reference period. 
Final data, 10-12 months 


1966-67 to present 


Changes in tax laws results in- 
change in coverage and in number 
of matched returns over time and 
provinces 


Yes, well developed 
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There are also, however, certain weaknesses. The fundamental limitation is that neither 
Ontario nor Quebec can provide migration data. In the latter case, however, new developments 
are promising, but for Ontario, nothing is expected. Unless a special source is derived for 
Ontario, this would compromise the high potential of this file. There could also be a con- 
sistency problem, since each province independently administers its file. 

At the subprovincial level, migration could also be derived since the Provincial Health 
Care offices should be informed of any change of address. In the facts, however, all changes 
are not known. 

Health Care files could also be used in regression estimates, especially in provinces that 
run periodic address checks to clean the file and count only the desired population. 


2.2 Old Age Security Records 


Health and Welfare Canada is responsible for the administration of the Old Age Security 
file. Canadian residents aged 65 and over who totalled a sufficient number of years of residence 
in the country are eligible. It represents approximately 10% of the total population. Coverage 
among eligible people is virtually universal. Also the financial incentive to report change 
of address is very strong. Another strength of the file is its timely availability. Information 
on people moving in a given month is compiled and received by Statistics Canada two or 
three months later. Finally, Old Age Security, being a federal program, provides comparable 
data for the provinces; even if the information is compiled by provincial regional offices, 
they all follow the same procedure. 

The main shortcoming of this file for migration estimates purpose is the fact that it refers 
to a small portion of the population (varying from 7.3% in Alberta to 12.2% in P.E.1.), 
the elderly moreover showing a rather different migration pattern than the rest of the popula- 
tion. Unlike child migration, which can obviously be related to adult migration and then 
be blown up to estimate total migration, no similar efficient method could be developed to 
estimate total migration from the Old Age Security file. Although this could not be used 
as the main source for migration estimates, however, this file could provide a very interesting 


_ estimate of the elderly migration. 


3. MAJOR FILES USED AS SYMPTOMATIC INDICATORS 
OF POPULATION CHANGE AND OF NET MIGRATION 


Data from some administrative files could be useful for generating total population 
estimates. For example, School Enrolments, Hydro Connections or Telephone Residential 
Customers could be used in regression techniques as symptomatic indicators (see McRae 1985 
for an application of Hydro Connections to population estimates). This method and the cor- 
responding sources are generally used for producing small area population estimates, but 


_ if no other technique gives valuable estimates at the provincial level, these sources will be 


seriously considered. 


3.1 Hydro Connections 


Electric companies keep files of their customers. Information on the type of account 
(residential, commercial, farm, ...) and the address and postal code of the customer are 
available. Coverage of residential households is virtually complete. Sometimes there is only 
one file for the province, but sometimes 2 companies (Manitoba and Newfoundland) or even 
more (B.C. and Ontario) provide the electric facilities within the province. In most provinces 
data can be produced for the entire territory, as of any date and within a short time-lag, 
but for a few provinces it can be hard to get the data. The main weaknesses of the files are 
of two kinds. In addition to the previously cited problem there may also be slight incon- 


_ Sistencies due to the difference between provincial definitions of residential households (since 
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it responds to administrative criteria), and even within one province, if more than one com- 
pany is involved. Nevertheless, Hydro Connections could be a very good source for popula- 
tion estimates. As a matter of fact, they were tested in British Columbia, where population 
estimates for municipalities and school districts were produced. The results were good. This 
method could also be tested and eventually be extended to provincial level estimates, if need be. 


3.2 Telephone Companies 


In Canada, telephone services are insured by 14 major telephone companies. Information 
on customers with residental lines (address and postal code) is available. The situation is 
roughly similar to that of the Hydro Connections files. Data can usually be obtained for 
specified dates within a rather small delay and the coverage is fairly high. Here again, more 
than one company may serve a given province, and also, a company may serve more than 
one province. Despite the fact no estimate based on Telephone files has been tested in Statistics 
Canada, it is felt that they have the potential to produce good results. 


3.3 School Enrollments 


Each provincial government maintains a computerized file on students enrolled in its 
primary and secondary school system, containing information on school addresses with the 
postal code and on the number of students, by age and grade. Information on the number 
of students refers to September 30 and is available between 4 and 10 months after the reference 
date, the time-lag varying by province. The coverage of students is also very good. 

There are some weaknesses associated with this file. For example, its annual character 
plays against its use for producing quarterly estimates. Also its date of reference (September 
30 instead of June 1), along with the up to 10 months delay is another handicap. At the 
subprovincial level, finally, it often can be observed that some students reside in a given ad- 
ministrative region, but go to school in a different one. This also could affect the quality 
of the estimates. It should be pointed out here that the school enrolment data, at one time, 
were used in Statistics Canada (and also by the U.S. Bureau of the Census, using a compo- 
nent method developed by them. See U.S. Bureau of the Census 1973, Chap. 237ipimds)): 
the deviations associated with that method were much higher than those with other methods. 
In case no other file could provide adequate population estimates, regression estimates with 
that file could produce acceptable results, at the provincial level at least. 


4. FILES WITH LIMITED POTENTIAL 
4.1 Driver’s License 


Each province maintains a file listing persons aged 15 (or 16, or 17) and over licensed 
to operate a motor vehicle. Using the provincial files,migration could be estimated in two 
ways: 1) compilation of changes of driver’s address for estimating flows of migration; and 
2) as a symptomatic indicator of the population change, through the variation of the number 
of people licensed in a given region. Currently, Ontario uses drivers’ licenses to estimate in- 
traprovincial migration, but very few other provinces could provide migration flow infor- 
mation, especially at the subprovincial level. In order to do so, it would require too much 
work and consultation with the provincial ministry. Despite the fact that drivers are forced 
by the law to report their change of address, not all do so, and no sufficiently detailed statistics 
are available. 

The driver’s license file could also be used in regression techniques. Data available at any 
specified date in many cases and short delays are positive points. However, coverage and 
consistency concerns might affect the quality of the data. For example, 83% of adults in 
Saskatchewan own a driver’s license, as against 73% in Manitoba, 85% of males and 62% 
of females accounting for the latter province’s average proportion. In addition, the poor, 
comparatively recent immigrants, and Indians and residents of remote communities in the 
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North have below average rates for holding licenses (Stock 1981, p. 44). For estimate pur- 
poses, it is often preferable to have a 100% coverage of a small subpopulation (e.g. children) 
than an 80-85% coverage of a large subpopulation (e.g. adults), especially if the coverage 
is selective with respect to migration. Although it does not necessarily make a file inappropriate 
for estimate purposes, it affects its potential. 


4.2 Building Permits 


Statistics Canada collects new building permits for cities and rural areas in Canada. On 
average, the coverage rates vary between the urban (98.5%) and the rural (62.5%) areas. 
The building permit data are available on a monthly basis at the census division level. These 
data could be also used as a symptomatic indicator of the population change. However, one 
of the weaknesses of the building permit data is the fact that they refer to the date of permit. 
Due to this, it is not certain whether the building has been constructed and also, whether 
it has been occupied. Another weakness is the fact that the number of permits issued is not 
necessarily directly related to population change, especially in the case of a decreasing 
population. 

Thus, the use of building permit data also seems to be limited in estimating population 
for the different geographic areas. 


4.3 Unemployment Insurance Commission 


The Unemployment Insurance Commission keeps a list of the beneficiaries of the pro- 
gram. A 10% sample of this file has been developed to produce statistics and it could pro- 
vide migration information. However, this file could hardly be used to estimate migration 
in Canada. First, a 10% sample of unemployed corresponds to less than 1% of the popula- 
tion. From such a small subpopulation, flows of migrants between provinces could not be 
derived. Also the non-representativity of that sample (young adults representing a good part 
of non-employed) calls for suspicion concerning the migration data from that file. 

The Commission also maintains a file of wage earners who are contributing to the 
Unemployment Insurance program. However, no in depth analysis of this file has been done. 


4.4 Labour Force Survey 


In 1982, Statistics Canada conducted a sample survey of 56,000 households in Canada. 
The civilian non-institutionalized population aged 15 and over, included in the sample, residing 
in all provinces were asked a question on their migration history of the past 5 or 6 years. 
Other valuable information is also available. However, its very small sample (approx. 4% 
of the population) and the fact that the survey was conducted only once eliminates the Labour 
Force Survey as migration estimates source. 


5. Voters’ List 


_ Data on voters are generally available in Canada. Federal and provincial election lists could 

easily be obtained while obtaining municipal lists would necessitate more work. Those lists 
give information on the number of canadian citizens aged 18+ (landed immigrants are in- 
cluded at the municipal level only). They cover an average 90-95% of the target population. 
The main shortcoming of that source is that it is not available at regular intervals. Federal 
and provincial lists are made for elections about every 4 years at dates that are not useful 
for estimation purposes. It thus seems pointless to consider voters lists. 


6 Retail Sales 


_ Data on retail sales are collected by Statistics Canada on the basis of sales figures from 


large stores and from a sample of smaller businesses. These data are collected on a monthly 
basis and they are made available 3 months after the reference date. These data could be 
used as a symptomatic indicator of the population change. However, the utility of this data 
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set seems to be limited in the case of population and migration estimations. This could be 
due to the fact that retail sales are heavily affected by the economic fluctuations which may 
not accurately reflect changes in the size of population. 


4.7 Trucking Statistics (Moving Companies) 


Statistics on a sample of five major moving companies are available in Canada. They cover 
about 90% of all moves. The interprovincial migration flow could be assessed by weighing 
the number of reported moves between two different provinces/territories. However, truck- 
ing statistics are seriously affected by a time-lag of two years or more. 


5. CONCLUDING REMARKS 


In this report, an overview of strengths and weaknesses of twelve administrative data files 
has been presented in order to make recommendations for selecting an alternative data source 
to the family allowance files. It has been found that there is no file with strengths and 
weaknesses strictly comparable to those of the family allowance files. However, if the fami- 
ly allowance files cease to be universal, one could suggest the following recommendations: 


~ Continue further developments in the use of the provincial health insurance file and 
the Old Age Security records of the federal government in order to produce the total 
population and migration estimates on a quarterly basis; 


~ Examine the quality of annual population estimates for the provinces and territories, 
produced by the Component Method II using the migration estimates from the provin- 
cial school enrollment data files; and 


— Test the accuracy of the provincial administrative data files (health insurance files, hydro 
connections, telephone companies and driver’s licence) as symptomatic indicators of 
the population change and the residual net migrants for sub-provincial areas (census 
divisions and census metropolitan areas in Canada). | 
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Use of Administrative Data Files for Migration Estimates: 
A Case Study of Driver’s Licence File in Ontario! 


RAGHUBAR D. SHARMA and CHEUK WONG? 


ABSTRACT 


In Canada, provincial and federal demographers have attempted to use various sets of administrative 
data to estimate migration flows. This paper presents the development of intra-provincial migration 
estimates using driver’s licence data in Ontario. An evaluation of these migration estimates has been 
carried out by comparing with those derived from the income tax data by Statistics Canada. Both files 
provide equally good and complimentary estimates of intra-provincial migration. 


KEY WORDS: Administrative files; Population estimates; Component method; Small areas; Error 
of closure; Intraprovincial migration. 


1. INTRODUCTION 


Migration is an important component of population projections, and population estimates. 
As no records regarding the movement of population are kept in Canada, demographers 
in the federal and provincial governments have attempted to use various sets of administra- 
tion data to estimate migration flows. Statistics Canada uses revenue data (Norris and Stan- 
dish 1983), British Columbia utilizes hydro-hookups (McRae 1985), and Alberta uses health 
care records (Alberta Bureau Statistics 1985). Since 1979, Ontario has been using drivers’ 
licence address changes to estimate intra-provincial migration. Apart from the quality aspect, 
one major attractiveness of the driver licence data is in its timeliness. There is only a 4 to 
5 week time lapse between receiving the data and the date of reference compared with over 
one and one-half years in revenue data. In this paper we shall present an evaluation of estimates 
of intra-provincial migration derived from the driver’s licence data in Ontario. In the U.S.A., 
the State of California also uses driver’s licence address changes for the estimation of intra- 
provincial migration (Hoag 1984). 


2. DRIVER’S LICENCE DATA FILE 


Information on driver’s licence address changes is made available by the Ontario Ministry 
of Transportation and Communications (MTC). A driver is required to notify the Ontario 
Ministry of Transportation and Communications within 90 days of his/her change of ad- 
dress. The information is available at the postal code area level. These postal code areas 
can be converted into such subprovincial areas as, counties, regions and municipalities. As 
Table 1 indicates, data are available for the past seven years. Since 1979, data are also available 
for each quarter of these years. 

More than a million changes of addresses are recorded every year. The majority of these 
moves tend to be within census divisions (that is, county or regional municipality). However, 
net inter-county movers averaged only about 22,000 per year. Table 1 indicates that about 
one-third of the records do not provide a postal code for either origin and/or destination 
of the mover. 


! Abridged version of a paper presented at the meetings of The Federal-Provincial Committee on Demography, 
November 28-29, 1985, Statistics Canada, Ottawa. 

4 Raghubar D. Sharma and Cheuk Wong, Sectoral and Regional Policy Branch, Ontario Ministry of Treasury and 
Economics, Queen’s Park, Toronto, Ontario M7A 1Y9. 
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In Ontario, a person becomes eligible to hold a driver’s licence at the age of 16 years. 
More than 75 per cent of the eligible population holds a driver’s licence. The elderly popula- 
tion and female population have a much lower tendency to hold a drivers’ licence (Table 2). 


3. CONVERSION OF DRIVERS TO MIGRANTS 


An adjustment factor is applied to the number of drivers to arrive at the number of movers. 
This adjustment factor (F) is calculated as follows: 


Known and Unknown Movements 


FA = 
Known Movements 
iM Total Population 
Population with a Licence 
F =/FA x FB: 


Table 1 


Number of Total Movers and Number of Movers with 
Unstated Origin and/or Destination, Ontario, 1975-1985 


a 


No. of Known No. of Origin % 
Year Movers and/or Destination Total Unetited 
(Inter & Intra Country) Unstated 

1979 (Calendar Year) 881,000 0 881,000 0 
1979/80 586,000 301,000 887,000 34 
1980/81 566,000 306,000 872,000 35 
1981/82 617,000 270,000 887,000 30 
1982/83 648,000 259,000 907,000 29 
1983/84 822,000 320,000 1,142,000 28 
1984/85 831,000 330,000 1,161,000 28 


Source: Ontario Ministry of Transportation and Communications. 


Table 2 
Percent of Population Holding Driver’s Licence, Ontario, 1981 


% of Population Holding 
A Driver’s Licence, 1981 


Age Male Female Total 
16-19 63.9 36.5 49.1 
20-24 92.7 73.0 85.7 
25-34 98.6 81.6 90.0 
35-44 99.7 79.9 90.0 
45-54 96.7 67.8 82.4 
55-64 93.2 56.7 74.2 
65+ dau 27.4 46.4 
Total 90.6 2.2 75.8 


a 


Source: Ontario Ministry of Transport and Communications. 
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FA accounts for the unstated origins and/or destinations and FB accounts for non-driver’s 
licence holders. The factor assumes that migration patterns of those who do not hold driver’s 
licence do not differ from those who hold driver’s licence. Similarly, it assumes that migra- 
tion patterns of those with unstated movements do not differ from those whose movements 
are stated. 


4. INTRA-PROVINCIAL MIGRATION ESTIMATES: 
DRIVER’S LICENCE VERSUS INCOME TAX FILES 


Statistics Canada uses change of address as provided by a taxpayer on his annual income 
tax return. The number of children are estimated from the number of dependents claimed 
by the taxpayer. Like the driver’s licence data, adjustment factors have to be introduced 
to the revenue data to overcome unstated postal code and people who do not file an income 
tax. Furthermore some taxpayers use a non-residential mailing address in their return. 

The relative accuracy of migration estimates derived from the income tax file and driver’s 
licence file needs to be tested. Three measures have been applied to test this relative accuracy 
of the two data sets. They are: 


A. Errors of Closure 
B. Growth Rates Test 
C. Index of Dissimilarity 


Ideally, errors of closure and growth rates should be calculated from the population 
estimates from one census year to the next. Reliable data on driver’s licence address changes 
_are available only from 1979 onwards in Ontario. Therefore, 1979 intercensal population 
estimates of Statistics Canada and estimated 1981 population were used as base. Two sets 
of population estimates were calculated. First, using the driver’s licence address file for intra- 
provincial migration and second, using the income tax file for intra-provincial migration. 
All other components, i.e., births, deaths, interprovincial migration and international migra- 
tion were kept the same for both sets of population estimates. 


4.1 Errors of Closure 


Two sets of population estimates for the census divisions (one using driver’s licence data 
and the second, using income tax data for intra-provincial migration) were compared with 
the 1981 census population. The percent difference in the estimated population from the 

census population is called error of closure. Out of 49 census divisions, 23 have smaller er- 
rors if driver’s licence data are used and 26 census divisions have smaller errors if income 
tax data are used to estimate intra-provincial migration. 


4.2 Toronto Urban Complex 


A quite interesting picture emerges in the Toronto Urban complex which includes six 
regional municipalities (Table 3). Driver’s licence data yields a smaller error of closure for 
the complex as a whole and under-estimates the population for the areas outside of Metro 
Toronto. 

The income tax file gives lower errors of closure for individual census divisions within 
| the complex whereas, for the complex as a whole the error is larger than the driver’s licence 
: file (Table 3). Accordingly, driver licence data were used for estimating intra-provincial migra- 
tion for the Toronto complex as a whole and the distribution to individual census divisions 
| of the complex was based on revenue data. 


: 4.3 Population Growth Rates 


Percent change in the population estimates from 1979 to 1981 was calculated for the 
estimates derived by using the driver’s licence file and the income tax file respectively. These 
growth rates were compared with the 1981 census growth rates. 


| 
| 
| 
| 


: 
| 
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Table 3 
Errors of Closure for Toronto Urban Complex 


Errors of Closure 


Income Driver’s 

Census Division Tax Nicence 
Durham R.M. —0.21 —0.41 
Halton R.M. 0.45 —0.51 
Hamilton-Wentworth R.M. 0.10 —0.11 
Peel R.M. — 0.63 — 1.94 
Toronto R.M. 0.01 1.26 
York R.M. —0.11 — 5.70 
Total Toronto Urban Complex — 0.05 —0.01 


There is not much difference in the relative closeness of growth rates of the two sets to 
census growth rates. The number of census divisions which yield different direction of popula- 
tion change than the census growth rate are 3 for the driver’s licence data and 10 for the 
income tax file (Table 4). This is one aspect where the driver’s licence data appeared to yield 
more reliable estimates than the revenue data. 


4.4 Index of Dissimilarity 


Index of dissimilarity was calculated for in- and out-migration separately, as the direc- 
tion of net migration was not the same for some counties for the two sets of estimates. The 
value of the index of dissimilarity can vary between 0 and 100. It is the half of the sum of 
the absolute differences between the two corresponding percent distributions and is equivalent 
to the sum of the positive differences or the sum of the negative differences (Shryock and 
Siegel 1971). The general formula is: 


ID= %Y|n-1| 


where, r, and r,; are the corresponding percentages in the two distributions. 


The low values of the index indicate that both files (the driver’s licence and the income 
tax) yield quite similar estimates of intra-provincial migration for the census divisions of 
Ontario. However, over the four years the extent of dissimilarity increases for out-migration 
and improves for in-migration (Table 5). 


5. CONCLUSION AND SUMMARY 


This study attempts to compare the intra-provincial migration estimates derived from the 
driver’s licence file with those derived from the income tax file. Both files provide reasonably 
good measures of the magnitude of intra-provincial migration for the Census Divisions of 
Ontario. 

Although the driver’s licence data appeared to provide better estimates in the direction 
of intra-provincial migration, the income tax data resulted in slightly more counties with 
smaller errors of closure and in addition yielded somewhat better results in some major areas 
(for example, distribution within the Toronto/Hamilton urban complex). In view of their 
respective strengths, the appropriate approach is to combine the use of these two data sources. 

Another issue that should be noted is that the evaluation was based on three years only 
i.e., 1979 to 1981. A more accurate assessment on the quality of these two data files cannot 
be made until the availability of the 1986 census data. 
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To further improve the quality and the applications of the driver’s licence data, the follow- 
ing two areas are suggested for further research: 


e Verification of FA factor based on actual counts of unknown origin/destination through 

using manual coding of addresses. 

e Extension of the use of the driver’s licence data file as an additional source to family 

allowance and revenue data for inter-provincial migration estimates. 

The driver’s licence file tends to over-estimate migrants for Metro Toronto and under- 
estimate for the areas surrounding Metro Toronto. The reverse seems true for the income 
tax file. For this region as a whole, the driver’s licence file gives better estimates for intra- 
provincial migration than the income tax file. The income tax file provides a better distribu- 
tion of intra-provincial migrants in the counties of this region. 


Table 4 


Census Divisions Which Yield Different Direction of 
Population Change Than Those Based on Census 


% Changed Based On 


Income 


Census Division Census 
Tax 

Bruce 0.11 —1.77 
Grey — 0.04 0.17 
Hastings 0.12 — 1.03 
Leeds and Grenville 0.21 — 0.32 
Niagara O215 — 0.46 
Northumberland 0.31 — 0.82 
Oxford 0.17 — 0.09 
Parry Sound 1.20 —0.51 
Stormond/Dundas/Glengarry 0.44 —0.17 
Sudbury T.D. 0.41 — 2.28 
Province 1.60 1.46 


% Change Based On 


Driver’s Cc 
Licence census 
Leeds and Grenville 0.02 — 0.32 
Parry Sound 0.81 —0.51 
Thunder Bay 0.42 — 0.20 
Province 1.60 1.46 
Table 5 
Index of Dissimilarity 
Index of Dissimilarity 
Year In-Migration Out-Migration 
1979-80 5.61 3.50 
1980-81 5.54 3.82 
1981-82 5720 4.82 


1982-83 4.4] 4.87 
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The Development of Alberta Health Care Records 
and Their Application to Small-Area Population Estimates! 
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ABSTRACT 


This paper examines the use of administrative files from Alberta’s Health Care Insurance Plans com- 
bined with Vital Statistics data as inputs for estimating population. Results, which are presented and 
compared with Census data, indicate that Health Care data can be used to produce accurate popula- 
tion estimates at the provincial level and for smaller areas such as census divisions and municipalities. 


KEY WORDS: Administrative files; Component method; Small areas; Residual net migration. 
1. BACKGROUND 


During the mid to late 1970’s, the Province of Alberta experienced rapid economic growth 
led by activity in the oil and gas industry, which generated high population growth. Govern- 
ments, in order to effectively provide goods and services for the influx of people into various 
regions, required timely data on where and by how much population was growing. With 
the need for up-to-date population data, it was felt that the federal quinquennial census was 
not sufficiently frequent nor current (census data are released about twelve to eighteen months 
after the reference year). Consequently, provincial agencies, and in particular, the Alberta 
Bureau of Statistics, began investigating alternative sources of timely population data. 

After examining a number of potential sources, the Bureau began assessing administrative 
health care insurance data from the Alberta Health Care Insurance Plan (AHCIP) files to 

_ develop population statistics. The remainder of this paper highlights work undertaken by 
the Bureau to develop the AHCIP records and to use the data in estimating small-area 
population. 


2. DEVELOPMENT OF AHCIP RECORDS INTO HEALTH CARE COUNTS 


This section describes briefly the nature of the AHCIP records and evaluates the counts 
developed. 


2.1 Developing Health Care Counts Data 


The Bureau receives selected registration records via computer tape, on a quarterly basis, 
from the AHCIP registration-billing system. (The tape contains only a partial listing, in par- 

| ticular, all names, identifiers, etc. have been stripped such that the confidentiality of all in- 
dividuals is strictly preserved.) The file contains information such as addresses, postal codes, 
| registration and cancellation dates, age and sex for every registrant. (A detailed description 


Of the record layout is available upon request.) 


= 
a Abridged version of the paper presented at the Federal-Provincial Committee on Demography meeting held on 
November 26-27, 1985, Ottawa, Canada. 


| 2k, Ahmad, R. Chow, O. DeVries, A. Hashmi and M. Marcogliese, Alberta Bureau of Statistics, Alberta Treasury, 
Sir Frederik W. Haultain Building, 9811-109'" Street, Edmonton, Alberta, Canada T5K OC8. 
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The reporting unit of the AHCIP file is the registration. Each registration may contain 
up to twenty-five individuals; one registrant (usually the person who pays the premiums) 
and up to twenty-four dependents. There are currently about 1.7 million active registrations 
accounting for roughly 2.6 million individuals. In addition, the file is historical and includes 
all individuals ever covered under AHCIP since its inception in 1969. 

The file is processed through four phases. 


a) Edit-notes and/or corrects errors according to edit check criteria. 
b) Purge-uses the edited raw data file and selects active individuals. 


c) Consolidation-matches postal codes between the purged file and the Bureau’s Postal 
Code Translator File (PCTF) and attaches the geographic reference information to the 
AHCIP records. 


d) Aggregation-takes the consolidated file and aggregates males and females by single 
years of age for each postal code. This reduces the number of records/individuals from 
approximately 2.6 million to fewer than 120,000 and significantly reduces the subse- 
quent systems processing costs. 

The aggregated file is used for the production of age and sex counts by any geographic 

area definable through the 60,000 PCTF Alberta codes. 


2.2 Evaluation of the Counts Data 


To evaluate the health care counts data, Census of Canada population figures for 1976 
and 1981 were used for comparison. The 1981 AHCIP records were considered to be more 
accurate than the 1976 file, therefore, the evaluation relied more heavily upon the 1981 cen- 
sus comparisons. Also used as a second basis of comparison were municipal censuses data, 
even though these data generally were not considered to be as reliable as Canada Census 
figures. The municipal censuses, however, provided insight into the magnitude of the varia- 
tions as well as the relative distributions of age, sex and trends (growth or decline) over time. | 
An additional source of comparison was intercensal population estimates prepared by the 
Bureau and by Statistics Canada. 


Basic findings: 
a) On a provincial basis, AHCIP counts overestimate both Canada Census and total 
municipal censuses figures by about 3.5% to 4.5%. Age and sex distributions are more 


accurate and the correlation coefficients indicate consistency of trends (over/under 
estimates) over time. 


b) At the census division (CD) level, AHCIP counts varied from Canada Census figures 
from —2.6% to 9.7% (see Table 1). Comparisons with intercensal population estimates 
indicated a similar variance. As with the provincial level data, age and sex distribu- 
tions and the trend consistency proved highly reliable. 


c) At the census consolidated subdivision (CCSD) level, for fifty of the seventy-one CCSDs, 
health care data were within + 10% of the Census counts. The largest discrepancy was 
— 56.5% (Municipal District 135). 


Most problem areas had major urban centres located close to the county, municipal 
district and improvement district boundaries. No specific anomalies were found when 
testing the age and sex distributions, although relationships were not as strong as with 
the province and the census division levels. 


d) At the census subdivision (CSD) level, preliminary figures showed discrepancies be- 
tween the AHCIP counts and 1981 Census data ranged from — 100% to +955%. 
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Consequently, the twenty-eight largest areas of over 5,000 in population were used at 
the CSD level. The six largest CSDs (Edmonton, Calgary, Lethbridge, Medicine Hat, 
Red Deer and St. Albert) displayed overcounts ranging from 3% to 9%. Eight other 
CSDs differed up to + 20%, while sixteen showed somewhat greater than + 20% varia- 
tion. Again, no specific age and sex distribution anomalies were detected, although 
discrepancies were greater than those at more aggregated levels. As well, twenty-seven 
of the twenty-eight CSDs indicated high trend consistency. 


As the geographic area decreases in size, AHCIP counts become less reliable; age and 
sex distributions, although less accurate, still remain strong; and trend consistency (counts 
over time) remain highly correlated with a few notable exceptions. The limitations of AHCIP 
counts as population indicators primarily can be attributed to one of two main sources: a) 
the AHCIP administrative procedures/inaccuracies; or b) use of postal codes. 


a) AHCIP Administrative Procedures: 


1) As an insurance programme, a chief concern is to supply coverage. Therefore, efforts 
are directed to getting people onto the system to ensure universal coverage with less 
effort placed on getting individuals off the system. This has resulted in more people 
being registered than are actually in the province. 


Table 1 


Comparisons of Alberta Health Care Counts and Canada Census Data 


for Alberta Census Divisions 
eee eee, OT SRW STECE Ht grates tevry ir yilnimernps treed tion new! ever saben Peta 1P 


Year 
Se a ae ee eee (Pa Oe AOR EO 1 
1976 1981 

Percent Actual Percent Actual 


Census Census AHCIP Difference Difference Census AHCIP Difference Difference 
Division Count Count Gaunt Gaunt ount count Gaunt Gaunt 


eee 


1 46,990 45,789 —2.56 — 1,201 D531) 55,748 0.67 373 

2 96,995 97,229 0.24 234 110,477 =: 111,567 0.99 1,090 

3 32,898 33,884 3.00 986 85,052 36,463 22h 811 

4 12,120 12,101 —0.24 —29 12,119 12,038 -—0.67 —81 

5 35,424 35,656 0.65 232 38,382 38,457 0.20 75 

6 524,554 538,432 2.65 13,878 668,682 699,999 4.68 31,317 

i, 37,866 38,235 0.97 369 40,071 40,359 0.72 288 

8 95,384 95,063 —-0.34 — 321 123,642 124,666 0.83 1,024 

9 19,903 21,832 9.69 1,929 21,670 23,338 7.70 1,668 

10 67,171 67,168 0.00 —3 78,417 78,532 0.15 115 

11 632,909 646,799 2.19 13,890 762,041 796,884 4.57 34,843 

12 63,129 62,011  —1.77 — 1,118 84,221 86,183 2.33 1,962 

13 46,305 47,258 2.06 953 53,701 54,282 1.08 581 

14 19,386 21,039 8.53 1,653 24,635 25,991 5.50 1,356 

| 15 106,993 111,678 4.38 4,685 128,639 134,451 4.52 5,812 

Unknown? 48 ,462 19,279 


“Alberta 1,838,037 1,922,636 4.60 84,599 2,237,724 2,338,237 4.49 100,513 
Ee SER RE ec aa a a etc I a a ok a a Sete 


| * Unknown, represent counts without address identifiers. 
Source: Statistics Canada 1976 and 1981 Censuses; Alberta Health Care Insurance Plan data, prepared by Alberta 
Bureau of Statistics, Alberta Treasury. 
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2) Mailing addresses are used rather than residential addresses, which has created dif- 
ficulties in assigning geographic locations. Discrepancies occur in areas where signifi- 
cant rural populations surround an urban centre and the rural populace pick up their 
mail in the urban centre. Consequently, most urban areas are overcounted while rural 
areas are undercounted. 


3) Incomplete and inaccurate data, especially related to postal codes, make it difficult 
to produce small-area statistics due to undercounting. 


4) Time lags in reporting and recording of the data influence counts. Generally speak- 
ing, it takes three to six months to get an individual onto the system (birth, in-migrant) 
but it requires usually much longer to be removed from the active system (death, out- 
migrant). The lags, however, are difficult to follow and differ substantially depending 
on the circumstances. 


b) Postal Codes: 


1) Postal codes define delivery service areas (where a person gets his mail), not necessari- 
ly a residence. This factor limits the accuracy of assigning AHCIP registrations to ap- 
propriate geographic areas. In particular, it creates urban-rural split problems, as 
discussed. 


2) A six-digit postal code, by itself, is not always enough to determine the service delivery 
area. A rural route, suburban service, or box number may be required to further specify | 
a more exact location. 


3) Postal codes have been insufficient, especially in rural areas, to aggregate to appropriate 
levels. For example, there are approximately 363 census subdivisions in Alberta, but 
the Bureau’s PCTF can derive only 324 of these. 


The problems outlined above have precluded the release of AHCIP counts as approxima- 
tions of actual population. Although the counts were quite good in some areas, in others, 
they were poor or inconsistent. With the strong relationships between health care, age and 
sex distributions and those of Canada Census, as well as the consistency of trends over time, 
the counts have been used in conjunction with the Bureau’s population estimation 
methodology (as discussed in the next section). 


3. APPLICATION OF HEALTH CARE COUNTS TO 
SMALL-AREA POPULATION ESTIMATES 


The Bureau has produced intercensal population estimates for Alberta and provincial census 
divisions for nearly a decade. During this period, various methodologies and data sources 
have been examined and used to improve the quality of these estimates. To date, significant 
success has been achieved with the component method using health care counts as input data. 
These data have been used to derive the age and sex structure of the Alberta population at 
the provincial and census division level and to produce provincial and census division popula- 
tion estimates. Also, recently, the data have been used to test the applicability in preparing 
census subdivision population estimates. 


3.1 Estimation Methodology 
The estimation methodology employed by the Bureau to produce subprovincial popula- 


tion estimates is comprised of two parts. Part one presents the method of estimating migrant 
population. Part two outlines the method used to develop population estimates. 
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a) Estimating Migrant Population Using Health Care Counts 


The Bureau developed data from three administrative files: counts from AHCIP records; 
births from data supplied by Alberta Vital Statistics; and deaths, also supplied by Alberta 
Vital Statistics. These sources were used to calculate net migration. Basically for any small 
area, the growth of health care counts is obtained from the differences in counts between 
time ¢ and time t-J. This residual less the area’s natural increase (births minus deaths) calculates 
the inflow (or outflow) of individuals, i.e., net migration. This procedure is mathematically 
expressed as: 


HIMIG = [(HC, — HC,_,) -— (B — D)] 


Where: 
HMIG = health care net migration counts between time ¢ and t— 1 
HC, = total health care counts at time f¢ 
HC,_, = total health care counts at time t—1 
B = total births during time interval ¢ to t—1 
D = total deaths during time interval ¢ to t—1. 


This health care migrant population estimate, however, is subject to the same over and 
under counting difficulties discussed in Section 2. As a result, although this approach would 
prepare estimates for small areas at the provincial level, these estimates would be less reliable 
than the provincial migration estimates currently derived using interprovincial flows to family 
allowance recipients. (The family allowance files are also used by Statistics Canada, which 
ensures provincial estimates generally are consistent with those produced at the federal level.) 

To further improve the small-area migration estimates and to ensure consistency with 
estimates at the provincial level, should the small areas be aggregated to a provincial total, 
an adjustment using a ratio distribution was encompassed. With this approach, the ratio 
of net migration from health care counts for an area over the net migration from health 
care counts for the province is multiplied by the provincial net migration calculated in con- 
nection with the Bureau’s quarterly population estimates. Mathematically, the equation is: 


HMIG, 
AMIC, = ————. x PMIG 


HMIG, 
Where: 
AMIG, = adjusted net migration in area ij 
HMIG, = health care net migration of counts for area i 
HMIG, = health care net migration of counts for Alberta 
PMIG = estimated provincial net migration from Alberta’s 


quarterly population estimates. 
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This adjusted migration estimate (AMIG) is then used as input into estimating population. 


b) Estimation of Population For Small Areas 


The adjusted estimated net migration (AMIG) for each area is used in an equation using 
the components of population growth (births, deaths and migration): 


P; = | 44 ar (B; a D;) a AMIG; 
Where: 
| estimated population in area i at time ¢ 
P,_, = population in area Teat.titne. ti 


3.2 Evaluation of Small-Area Estimates 


Using the above approach, the Bureau has developed population estimates for Alberta’s | 
fifteen census divisions and twenty-eight municipalities with populations over 5000. The 
results, so far, have been promising. 

The results of a comparison between 1981 census data and estimates for 1981 prepared 
with 1976 census figures as a base population using the above described methodology, at 
the census division level, are presented in Table 2. For thirteen of the fifteen divisions the 
estimates were within +2.0% variation compared to the 1981 census. Only the two smallest 
CDs (9 and 14) showed a five-year deviation greater than 2.0%. The average absolute devia- | 
tions (i.e., average annual deviations) were no greater than 0.5% for all census divisions. 

The twenty-eight population estimates for municipalities were compared to the 1981 cen- 
sus counts, as well as available data from municipal censuses conducted from 1982 to 1984 
(Tables 3 and 4). Federal census comparisons showed nineteen estimates of the twenty-eight 
municipalities had an average absolute deviation of less than + 1.0%. Only six municipalities 
had annual differences greater than 2.0%. Comparisons with municipal censuses conducted 
between 1982 and 1984, yielded twenty-two instances of deviations within + 1.0%, fourteen 
ranging between +1.0% and +3.0%, while nine had deviations greater than + 3.0%. 

In general, the estimation results have been satisfactory and encouraging. The develop- 
ment of AHCIP registrant counts and the component approach employed to estimate popula- 
tion have improved the accuracy of the population estimates produced and opened up 
possibilities for deriving estimates for user-defined small geographic areas. The Bureau will 
continue to investigate ways to improve the AHCIP counts (some of which are related to 
new administrative procedures being incorporated for the AHCIP). Also, the population 
estimation methodology will be further refined as new data techniques become available. 


3.3 Summary of Advantages and Disadvantages of Using AHCIP 


Using health care counts in deriving small-area population estimates has a number of ad- 
vantages and disadvantages. 
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Table 2 


Comparisons of Canada Census Counts and Alberta Bureau of Statistics 
Population Estimates for Alberta Census Divisions 


Bureau Estimates? 


Natural Net ‘ 
Bison Ae eieiprg ene = Taare. gy Mieraben, AGRON” Population 
1976-81 1976-81 
ee eae COLT “eee ey a 
1 47,000 2,730 6,080 8,810 55,810 
2 96,980 6,120 7,190 13,310 110,290 
3 32,870 2,310 100 2,410 35,280 
4 12,140 490 — 520 — 30 12,110 
5 35,460 1,820 790 2,610 38,070 
6 524,570 33,860 107,540 141,400 665,970 
oT 37,820 2,010 —10 2,000 39,820 
8 95,400 6,140 20,860 27,000 122,400 
9 19,850 1,040 200 1,240 21,090 
10 67,230 1,650 8,550 10,200 77,430 
11 632,830 43,880 90,880 134,760 767,590 
12 63,130 6,470 16,130 22,600 85,730 
13 46,300 2,040 4,320 6,360 52,660 
14 19,450 2,200 2,430 4,630 24,080 
15 107,010 10,260 10,040 20,300 127,310 
Alberta 1,838,040 123,020 274,580 397,600 2,235,640 
Difference 
Average 
Dio oer Number : = 
Deviation 
1 55,360 450 0.81 0.16 
2 110,470 — 180 — 0.16 0.03 
3 35,640 — 360 — 1.01 0.20 
4 12,120 —10 — 0.08 0.02 
2) 38,430 — 360 — 0.94 0.19 
6 668,680 —2,710 — 0.41 0.08 
| S| 40,030 —210 — 0.52 0.10 
7 8 123,690 — 1,290 — 1.04 0.21 
9 21,630 — 540 — 2.50 0.50 
10 78,390 — 960 —1.22 0.24 
11 762,080 5,510 0.72 0.14 
12 84,220 1,510 1.79 0.36 
13 53,690 — 1,030 —1.92 0.38 
14 24,650 — 570 —2.31 0.46 
15 128,640 — 1,330 — 1.03 0.21 
- Alberta 2,237,720 — 2,080 — 0.09 0.02 


* Data are experimental. 

> Natural increase refers to the number of births minus the number of deaths. 

Note: Components may not add to total due to rounding. 

Source: Statistics Canada 1976 and 1981 Censuses; Alberta Bureau of Statistics Estimates. 
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Table 3 
Comparisons of the Canada Census Counts and Alberta Bureau of Statistics 
Population Estimates for Selected Alberta Municipalities 
Bureau Estimates? 

Municipality cae is 9 ction tah ey Ree Soin: 

1976-1981 1976-1981 1981 Deviation 
Airdrie 1,410 580 5,090 7,070 8,410  —15.9 SP 
Brooks 6,340 730 200 9,440 9,420 0:2 0.0 
Calgary 469,920 30,310 93,760 593,990 592,740 0.2 0.0 
Camrose 10,100 150 2,570 12,830 12,570 P54 0.4 
Crowsnest Pass 5,250 40 — 410 4,880 7,310 — 33.2 6.6 
Drayton Valley 4,300 530 1,760 6,590 5,040 30.8 6.2 
Drumheller 6,150 20 220 6,390 6,510 —1.8 0.4 
Edmonton 461,360 27,900 51,240 540,510 532,250 1.6 0.3 
Edson 4,040 510 2,490 7,040 5,840 20.5 4.1 
Fort McMurray 15,420 2,900 14,140 32,460 31,000 4.7 0.9 
Fort Saskatchewan 8,300 800 2,660 11,760 12,170 — 3.4 0.7 
Grande Prairie 17,630 1,970 6,300 25,900 24,260 6.8 1.4 
Hinton 6,730 760 — 820 6,670 8,340 —20.0 4.0 
Innisfail 2,900 230 1,930 5,060 5,250 — 3.6 0.7 
Lacombe 3,890 150 1,210 5,240 5,590 — 6.3 he 
Leduc 8,580 920 3,430 12,930 12,470 3:7 0.7 
Lethbridge 46,750 2,070 4,400 53,220 54,070 —1.6 0.3 
Medicine Hat 32,810 1,770 6,010 40,590 40,380 0.5 0.1 
Peace River 4,840 580 970 6,390 5,910 8.1 1.6 
Ponoka 4,640 —10 530 5,160 5,220 -1.1 0.2 
Red Deer 32,180 2,300 11,790 46,270 46,390 —0.3 0.1 
Spruce Grove 6,910 1,110 4,710 12,730 10,330 23.52 4.6 
St. Albert 24,130 2,360 6,670 33,160 32,000 3.6 0.7 
Stettler 4,180 500 580 5,270 5,140 Pap 0.5 
Taber 5,300 320 410 6,020 5,990 0.5 0.1 
Vegreville 4,160 80 860 5,090 5,250 —3.0 0.6 
Wetaskiwin 6,750 300 2,440 9,490 9,600 -1.1 0.2 
Whitehorse 3,880 600 1,150 5,630 5,590 0.7 0.1 
Alberta 1,838,040 123,020 274,580 2,235,630 2,237,720 -0.1 0.0 


4 Data are experimental. 

> Natural increase refers to the number of births minus the number of deaths. 

Note: Components may not add to total due to rounding. 

Source: Statistics Canada 1976 and 1981 Censuses; Alberta Bureau of Statistics Estimates. 
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Table 4 


Comparisons of Alberta Municipal Censuses and Alberta Bureau of Statistics 
Population Estimates for Selected Municipalities 


1982 1983 1984 

Municipality Seti a Devia- as tes Devia- ats 7 uaa Devia- 
mate? Census ton % mate* Census tion % mate* Census tion % 

Airdrie 9,450 9,980 —-S5.3 9,830 10,430 -—5.8 10,080 -- -- 
Brooks 9,640 -- -- 9,790 -- -- 9,510 -- -- 
Calgary 614,930 623,130 -—1.3 622,510 620,690 0.3 615,140 619,810 —-—0.8 
Camrose 12,880 12,810 0.6 12,970 -- -- 13,070 12,750 Ia) 
Crowsnest Pass 7,490 7,580 —-1.1 7,530 -- -- 7,350 -- -- 
Drayton Valley 5,120 4,870 Ona 200 -- -- 5,310 4,920 7.9 
Drumheller 6,660 -- -- 6,700 6,670 0.4 6,620 -- -- 
Edmonton? 550,930 551,310 —0.1 557,400 560,090 -0.5 551,140 -- -- 
Edson? 6,110 46;290» — 2.9.5.4 56,220 -- -- 6,080 7,110 -—14.5 


Fort McMurray 32,930 33,580 -—1.9 33,600 34,490 -2.6 35,150 35,350 —0.6 
Fort Saskatchewan 12,530 12,460 0.6 12,650 12,470 1.4 12,620 -- -- 


Grande Prairie 24,650 -- -- 24,910 24,080 3.5 25,370 24,410 39 
Hinton 8,820 8,820 0.0 8,980 8,830 1.8 8,950 8,900 0.6 
Innisfail 5,420 5,440 —-0.4 5,460 -- -- 5,440 5,440 0.0 
Lacombe 555100 53720 PE pS 50 sc o0 ad 950 =e 8s 9,050 -- 


Leduc 12,880 - -- 13,010 -- -- 13,290 -- -- 
Lethbridge? 55,440 56,500 -1.9 55,900 58,090 -3.8 57,500 -- -- 
Medicine Hat? 41,070 -- -- 41,440 42,270 0.7 41,540 -- -- 
Peace River 6,080 -- -- 6,150 -- -- 6,250 -- -- 
Ponoka 5,310 -- -- 5,310 -- -- 5,280 -- -- 
_ Red Deer 48,450 48,560 -—-0.2 49,230 50,260 -2.0 50,860 51,070 —-—0.4 
; Spruce Grove 11,080 10,780 2.7 11,410 11,310 O90 411555077 11.570" —0:1 
St. Albert 33,170 32,980 0.6 33,740 35,030 -3.7 34,840 35,530 —-1.9 
Stettler 5,180 = -- 5,220 -- -- 5,300 - -- 
_ Taber 6,140 -- -- 6,210 “ = 6,360 6,380 —0.4 
Vegreville 528058 5,250 0.6 5,290 -- -- 5,390 -- -- 
_ Wetaskiwin 9,880 9,900 -0.2 9,990 10,020 -0.3 10,080 -- - 
Whitecourt 5,710 -- -- 5,840 -- -- 5,710 -- -- 


@ Data are experimental. 
Annexation took place between 1982 and 1984. 
_ Note: ‘‘--’’ indicates that a municipal census is not available. 
Source: Alberta Municipal Affairs, 1982-1984 Municipal Censuses; Alberta Bureau of Statistics Estimates. 
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Advantages: 

a) AHCIP registration data provides universal coverage of all individuals in Alberta; 

b) Registration lag appears to be random and does not adversely affect distributions or 
trends of the counts; 

c) Data are available on a timely/frequent basis; and 

d) The file contains some socio-economic information on registrants and dependents (e.g., 


age, sex and marital status) to enable the production of more than basic population 
estimates. 


Disadvantages: 
a) Residency based on postal codes can lead to some inaccuracies; 


b) AHCIP registrants can leave the system, for example, death and out-migration, without 
notifying AHMC resulting in overcounts; and 


c) Administrative procedures may cause discrepancies/ inaccuracies in the number of Alber- 
ta Health Care registrants. 


4. CONCLUSION 


Our experience with health care development has been very positive. The greatest poten- 
tial is the use of the counts in a component model to produce estimates for small areas as 
well as the excellent age-sex distribution ratios and trend consistency. Costs of development 
of the demographic reporting systems were not considered excessive in light of these benefits. 
For other provincial agencies contemplating the development of provincial health care files, 
the Bureau would certainly be willing to discuss its experiences in more detail and make 
available additional information, such as record layouts and system processing costs. 
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The Use of Hydro Accounts in the British Columbia 
Regression Based Population Estimation Model! 


DONALD G. McRAE2 


ABSTRACT 


The accuracy of small area population estimates derived from a regression based model is heavily depen- 
dent on the ability of the indicator data selected to accurately reflect population change. Hence, prior 
knowledge as to the characteristics of the administrative data used as potential population indicators 
in a regression model is important. This report summarizes the strengths and weaknesses associated 
with the use of residential hydro accounts in the British Columbia regression based population estima- 
tion model. 


KEY WORDS: Small area population estimates; Regression method; Difference-correlation method; 
Population indicators; Hydro accounts; Family allowance recipients. 


1. INTRODUCTION 


The Central Statistics Bureau produces post-censal population estimates for a variety of 
geographic units within the Province of British Columbia including municipalites, local health 
areas, census divisions and RCMP regions among others. Current population estimates are 
produced for these sub-provincial areas by means of a regression approach, specifically the 
Difference-Correlation Method (DCM). 

A detailed description of this methodology is given in earlier papers (Central Statistics 
Bureau 1982, McRae 1985). The data used as indicators of population are the number of 
family allowance recipients (F), and/or the number of residential hydro accounts (H). The 
characteristics of this second data source, residential hydro accounts, relative to the British 

Columbia model will be examined over the remainder of this paper. 


2. DATA SOURCES 


Residential hydro accounts data within British Columbia are obtained from nine different 
organizations. These are: 


% of Total Hydro 


Organization Accounts (1985) 
) (1) British Columbia Hydro 90.9 
: (2) West Kootenay Power and Light 4.7 
(3) Princeton Light and Power Co. 0.2 
(4) City of Kelowna 0.8 
(5) City of Penticton 0.8 
(6) District of Summerland 0.3 
(7) City of New Westminster la7 
(8) City of Grand Forks 0.1 
(9) City of Nelson 0.6 


1 Presented at the meeting of the Federal-Provincial Committee on Demography, Ottawa, November 28-29, 1985. 
: ED.G. McRae, Central Statistics Bureau, Ministry of Industry and Small Business Development, Government of 
British Columbia, 2"? Floor, 1405 Douglas Street, Victoria, British Columbia, Canada V8W 3Cl. 

The views expressed in this paper are those of the author and do not necessarily represent the views of the Government 
of British Columbia. 
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The major suppliers of residential electrical power are British Columbia Hydro and West 
Kootenay Power and Light. The other organizations purchase power from the two major 
suppliers, and retail this electricity to their own customers (usually the residents of the 
municipality). 


3. DATA FORMAT 


Of the nine sources of residential hydro accounts data, only that provided by British Col- 
umbia Hydro is in machine readable form. The other eight organizations provide the data 
totalled by municipality (urban), along with a total of any rural (non-municipal) customers. 
The reference date for all data is the May 31 billing file, and in most cases the data can be 
obtained within 2 to 3 weeks of the billing date. 

Data provided by British Columbia Hydro is in two formats. The first shows the number 
of residential meters as of May 31 by Capital District Code. A Capital District Code, of 
which there are approximately 248 in the Province, is an administrative unit used by British 
Columbia Hydro and corresponds to a municipality where municipalities exist. By agree- 
ment, both the major power suppliers in the Province pay each municipality a certain percen- 
tage of the annual revenue collected from the residential customers in that municipality in 
lieu of property taxes. As a result, power companies such as British Columbia Hydro, design 
their accounting systems to correspond to customers within municipal boundaries. In addi- — 
tion, British Columbia Hydro attempts to maintain a close correspondence between Capital 
District boundaries and school district boundaries. 

The second format provides for each of the one million plus residential meters the postal 
code of billing address. This second data file allows the easy translation of hydro meters 
to geographic units other than municipalities and school districts via the postal code. 


4. STRENGTHS OF THE HYDRO DATA IN A REGRESSION MODEL 


Empirical tests of the two different data sources, hydro meters (H) and family allowance 
accounts (F), were conducted by producing 1981 population estimates with each separately 
and together. The regression coefficients used were derived from the pooled 1971/76 and 
1976/81 periods, and the base year was 1976. The results were compared with the 1981 Cen- 
sus, and the Average Absolute Percent Errors (AAPE) were calculated. The results are given 
in Tables 1 and 2. 

As can be seen in Table 1, population estimates based on hydro data produce, on average, 
lower percentage errors than the family allowance based estimates. Closer examination of 
Table 1 reveals that the improvement in estimation accuracy lies almost entirely with the 
estimates for areas with population less than 4000. This observation is reinforced in Table 
2 where it is shown that, statistically speaking, there is a significant difference in the estima- 
tion accuracy between the hydro and family allowance based estimates for areas less than 
4000 population. 

The marginal effect of adding another population indicator to the Difference-Correlation 
Method can be judged by examining the change in estimation accuracy with and without 
the additional indicator. It would appear from Tables 1 and 2 that the inclusion of hydro 
data statistically improves the estimation accuracy in both large and small areas. Family 
allowance data, on the other hand, improves the accuracy for larger areas but reduces the 
estimation accuracy for smaller areas, with no statistically significant effect overall. 
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Table 1 


Comparison of Estimation Errors Among Data Sources 
for British Columbia Municipalities - 1981 


Population 
AAPE = 4000 < 4000 
Daratsonte? Overall : AAPE : AAPE 2 
DCM/H/F 5:53 158 2.99 88 8.72 70 
DCM/H 5.16 158 4.04 88 6.58 70 
DCM/F 10.46 167 4.57 92 17.69 ie) 
n y, i Y; 
AAPE = sy | +n x 100 
1=1 Y; 

where: 

Y; = census population for region / 

Y, = estimated population for region i 

n = number of areas estimated. 

Table 2 
Test for Statistically Significant Differences Between the Average 
Absolute Percent Errors for Selected Data Sources - 1981 
95% Confidence Interval for the Average 
Difference in Asbolute Percent Errors 
Population 
Data Source Overall = 4000 < 4000 

DCM/H/F — DCM/H .37 + .86 —1.05 + .56? 2s 76" 
DCM/H/F — DCM/F® — 4.86 + 1.55? —1.57 + .85? —9,.00 + 3.112 


@ Statistically significant differences at the 5% level utilizing a two tailed T-test, paired samples and 
assuming normally distributed means. 

> In order to pair the samples only 158 of the possible 167 family allowance base estimates were used. 
The number of observations were: overall, 158; greater than or equal to 4000, 88; less than 4000, 70. 


5. WEAKNESSES OF HYDRO DATA IN A REGRESSION MODEL 


One problem encountered when using hydro data in a regression model for population 


estimates is that of vacant dwellings, or more accurately, significant differences in the rate 


| 


of vacant dwellings between the base and estimating years. This weakness of the data was 
demonstrated by the 1981 evaluation of the communities in the Peace River-Liard region 
of British Columbia (McRae 1982). As a result of the North-East Coal project, the com- 
‘munities of the Peace River-Liard Census Division in 1981 were experiencing a building boom 
as developers contructed dwellings in anticipation of a population influx. Each dwelling, 
occupied or not (or even under construction) would require a meter, which may have had 
: low usage, but was still active and hence counted. As a result, the change in share of meters 
from 1976 to 1981 was overstated relative to population, producing overestimates of the 1981 
population for many of Peace-River Liard communities. 
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Another weakness of the hydro data is the potential for a change-over of multiple dwell- 
ing units from single to multiple meters. This may occur when an older apartment building, 
for example, serviced by a single meter is remodelled or replaced with individually metered 
units. This problem would produce an overestimate of population in a regression model if 
it were to occur sometime between the base and estimating years. 

Finally, some problems will result if the hydro data is used for areas that have a changing 
nature, or in other words, a changing relationship between population and residential meters. 
One example of this in B.C. is the resort community of Whistler. Fifteen years ago this 
municipality was largely a collection of winter cabins on a ski hill. However, over the last 
decade this area has been shifting to a year round residence basis. Consequently, the number 
of persons per hydro meter, which was originally very low relative to the B.C. average, is 
moving toward the norm. Like the vacancy problem, the use of hydro data to estimate the 
population for such a community would likely result in above average errors. 

The solution to all three of the problems mentioned above is to remove accounts that 
have a low monthly or bi-monthly usage, and hence are assumed to be vacant. The feasibili- 
ty of this procedure is currently being examined by the Central Statistics Bureau in relation 
to the data obtained from B.C. Hydro. If possible, we hope to have the improved data set 
available for calibration against the 1986 Census. Currently, as a partial solution hydro data 
for areas that had in 1981 a low or high ratio of persons per meter relative to the provincial 
norm (i.e. less than 2 or greater than 5) are not used. | 

A final potential weakness of the hydro data is the reliance on external and different 
organizations for the information. In the past, this situation generally has not proven to be 
a problem. However, there have been some rare cases that have called into question the quality 
of the meter data collected in the field. Such a case may be a boundary change of a municipality 
not being reflected in the meter data, or the addition of some types of non-residential ac- 
counts (such as lamp standards) to the data. As a result, careful monitoring of the data is 
important. 


6. CONCLUSIONS 


The following strengths and weaknesses are associated with the use of hydro data in the 
British Columbia regression based population estimation model. 


Strengths: 

(a) The hydro data, when used in a regression model, produces a lower average absolute 
percent error than family allowance data for small areas. 

(b) The data is obtained from each supplier in a format that is already aggregated to 
municipalities. The major advantage of this is that changes in municipal boundaries, which 
occur regularly, are reflected in the data with no additional work on the part of the Bureau. 

(c) The majority of the data can be obtained in machine readable form along with the postal 
code. This allows the easy translation of the data to geographic regions other than 
municipalities when sorted by the Bureau’s postal code Translation Master File. 

(d) The data can be obtained free of charge from each of the suppliers within a relatively 
short time period (2 to 3 weeks). 


Weaknesses: 


(a) Differential vacancy rates between the base and estimating years will bias the estimates. 


(b) Dwelling units (such as apartments) that change from a single to multiple meter sometime 
between the base and estimating years will bias the estimates upwards. 
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(c) Areas with a changing nature, such as from a seasonal to ‘‘stable’’ population, will in- 
troduce bias into the estimates. 

(d) The data is obtained from external and different organizations. This potentially could 
cause problems in terms of data quality and comparability, as well as producing a situa- 
tion in which the priorities of the Bureau’s population estimates program are subservient 
to the administrative needs of an external organization. 
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Estimating the Age/Sex Distribution 
of Small Area Populations! 


DAVID S. O’NEIL and CHRIS D. McINTOSH2 


ABSTRACT 


This paper describes a method of producing current age/sex specific population estimates for small 
areas utilizing as inputs total population estimates, birth and death data and estimates of historical 
residual net migration. An evaluation based on the 1981 Census counts for census divisions and school 
districts in British Columbia is presented. 


KEY WORDS: Age/sex population estimates; Small area; Residual net migration. 


1. INTRODUCTION 


The Central Statistics Bureau currently produces post-censal population estimates for a 
variety of sub-provincial areas using a regression approach (Central Statistics Bureau 1982). 


_ In addition to estimates of the total population by small area, age/sex specific estimates are 


also produced. 


This paper outlines the method by which age/sex specific population estimates are deriv- 
ed for subprovincial areas of British Columbia, given an estimate of the total population. 


2. OVERVIEW 


The methodology used to derive the small area populations by sex and single years of 
age is divided into two parts. 

The first part consists of examining historical residual net migration data compiled from 
censuses to derive a number of migration distributions by sex and single year of age for each 
small area (Shryock and Siegal 1980). 

The second part of the methodology consists of aging the base population for each sex 


_and adding births and subtracting deaths to yield a new population distribution for each 
area. This is referred to as the ‘‘natural base’’ population. The difference between the 


estimated total population by sex and the natural base population yields a residual term, 
which is equal to net migration by sex if the population and vital events for the two periods 
are exact. This small area sex specific residual term is distributed by single years of age ac- 
cording to a historical distribution, then added to the natural base population giving an age/sex 
specific population estimate for the area in the next time period. 

Due to the timeliness of the input data, estimates of the total populations can be produc- 
ed four months after the reference date of June 1, and the age/sex breakdowns one to two 


months later. 


! Abridged version of the paper presented at the meeting of the Federal-Provincial Committee on Demography, 
Ottawa, November 28-29, 1985. 

2 DS. O’Neil, SRL Sociometrics Resources Ltd., and C.D. McIntosh, Intersoft Resources Ltd., Central Statistics 
Bureau, Ministry of Industry and Small Business Department, Government of British Columbia, 2"4 Floor, 1405 
Douglas Street, Victoria, British Columbia, Canada V8W 3Cl. 

The views expressed in this paper are those of the authors and do not necessarily represent the views of the Govern- 
ment of British Columbia. 
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3. HISTORICAL NET MIGRATION DISTRIBUTIONS 


Age/sex specific residual estimates of net migration were compiled for the census periods 
1961/66, 1966/71 and 1971/76 for each of the 74 British Columbia school districts. These 
are referred to as the Historical Small Area Distributions. 

Examination of these net migration distributions by small area showed them to be ex- 
tremely unstable over time. In order to minimize the effects of this instability, a number 
of steps were taken. 

First, migration distributions by small area were separated according to whether they oc- 
curred during a time of positive or negative total net migration. It was found that residual 
migration age distributions for many areas differed depending on whether net migration was 
positive or negative. 

A further step taken to reduce the effects of unstable migration distributions was to group 
small areas of similar proportional migration distributions together, then calculate the positive 
and negative net migration distributions for each group of areas. These were called the 
Historical Grouped Distributions. Cluster analysis (using the SPSS/PC procedure) across 
selected age groups was used to group the historical small area migration distributions. Ex- 
amination of cluster memberships from different periods resulted in the placing of the ma- 
jority of areas into three clusters, while eight areas were maintained as unique independent 
clusters. Once areas had been arranged into groups, positive and negative migration distribu- 
tions were calculated from the most recent periods of positive or negative net migration. 


4. SMALL AREA POPULATION ESTIMATES BY SEX 
AND SINGLE YEAR OF AGE 


As noted in Section 3, some areas showed considerable time-series variation in the residually 
calculated net migration distributions. This was likely the result of two factors. First, many | 
of the areas under study possess small resource based economies subject to wide fluctua- 
tions, with consequent swings in migration levels. Second, a certain amount of instability 
is introduced when calculating a percentile distribution for a concept such as net migration, 
which may have either positive, negative, or zero values. 

In order to guard against adopting a historical net migration distribution that may not 
be a representative distribution for the estimating year, five different historical sex-specific 
distributions were calculated, then distributed by single year of age. A description of these 
five different net migration distributions is given below. 


1) The Historical Small Area Distribution for each small area having the same sign as 
the net migration to that small area was the first migration distribution. 

2) The Historical Group Distribution for the group the small area belongs to, having the 
same sign as the net migration to that small area, was the second migration distribution. 

3) The third migration distribution was calculated by separately totaling the migration 
from the most recent time period for all small areas with a positive and negative net 
migration, then calculating the age distributions. 

4) The fourth distribution was the distribution of the natural base population for each 
small area. 

5) The fifth and final distribution was the age distribution of migrants to British Colum- 
bia as a whole. For all the years under consideration, migration to B.C. has been 
positive, hence this is a positive distribution. Nevertheless, it was used as the fifth 
distribution regardless of whether the migration to a small area was positive or negative. 
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In some cases it was not possible to calculate all five distributions. This was the case if 
a small area never had a negative net migration in the past, but one is indicated for the 
estimating year under consideration. In situations such as this only distributions that can 
be calculated were used to distribute the small area net migration. 

Empirical testing based on the 1981 Census indicated that of the five net migration distribu- 
tions described above, number 1 (the Historical Small Area Distribution) produced the lowest 
average absolute percent error over all school districts and age groups, followed by number 
2 (Historical Grouped Distribution), then number 3, etc. However, despite the fact that 
distribution number 1 produced the lowest error on average, it did not produce the lowest 
error in each case. Hence, a selection procedure was designed to substitute the population 
distribution produced by number 1, with either 2, 3, 4, or 5 in only those cases where the 
population distribution produced by number 1 was considered unrepresentative of the 
estimating year population distribution. 

Empirical testing based on the 1981 Census resulted in the following selection procedure 
to be adopted. 

First, all migration distributions possible were calculated and added to the natural base 
population, resulting in up to five possibilities for the small area estimated population by 
sex and single year of age in the next time period. These age/sex specific population estimates 
were then examined to determine which one produced the least change in the small area age 
structure from the previous year. This was done by first calculating the unweighted average 

percent difference between the age structures for each of the five possible populations in 
time t+1 to the population in time t. Next, the standard deviations about these averages 
were calculated, and the distribution with the lowest standard deviation is flagged. If the 
standard deviation produced by using the Historical Small Area Distribution was significantly 
greater than the smallest standard deviation (i.e. of the flagged distribution), then the 
Historical Small Area Distribution was rejected. This procedure was repeated with the 
Historical Grouped Distribution, and so on until one of the five possible populations was 
selected. 

Once the ‘‘best’’ population in time ¢+ 1 was calculated for all small areas, two final ad- 
justments were made. First, family allowance data was substituted for the age groups 0-14, 
and the populations for the rest of the age groups were pro-rated to keep the total popula- 
tion of each small area constant. The second adjustment was to pro-rate the population to 
ensure the age distribution of the sum of the small area population estimates was consistent 
with the British Columbia age distribution estimated by Statistics Canada. 


5. EVALUATION OF THE CURRENT METHODOLOGY 


The following tables summarize the error associated with the June 1, 1981 population 
estimates by five year age group to 70+, for 74 British Columbia school districts and 29 
census divisions. The census division age/sex specific population estimates were derived by 

aggregating school district population estimates. 

_ The accuracy of the small area age/sex specific population estimates derived from the 
| previously described methodology was evaluated by producing 1981 population estimates 
by sex and 5 year age groups to 70+ for 74 school districts, then comparing these results 
to the 1981 Census. Two summary measures were used to evaluate the effectiveness of the 
_age/sex specific population estimates. These were Average Absolute Percent Error (AAPE) 
and Index of Misallocation (IM). The AAPE is defined as: 


AAPE = 100 x [z | (Pa: — Pai)/Pai) | ]/N 
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where P,,; is the estimated cell population for age group i, P,; is Census cell population for 
age group i, and N the number of cells. The JM is defined as: 


1 N N 
IM = 100 x 5[ ¥(| Pai - Pzi|)|/ ¥, Pai 


where P,,; is the actual cell population for age group i, and P,, is the estimated cell popula- 
tion for age group /. 


As seen in Table 1, relative to the 1981 Census the average absolute percent error over all 
age groups and regions is 6.20%, and the IM is 1.95%. The average percent errors for male 
and female are quite similar (AAPE’s of 7.00% for both, and IM’s of 2.15% for males and 
2.08% for females). 

By age, the highest errors occur in the 20-29 and 60-69 age groups. It should also be noted 
that there is some difference in the age distribution of errors between males and females. 
Males appear to have higher error in the upper age groups, while females have higher error 
in the very mobile 20-29 age groups. 


Table 1 


Error by Age Group Across School District 
1981 Estimated Versus Census 
Absolute Average Percent Error (AAPE) and Index of Misallocation (IM) 


Total Male Female 

Age AAPE IM AAPE IM AAPE IM 
(%) (%) (%) (%) (%) (%) 

0-4 3233 0.96 3.94 1.21 3.62 1.04 
5-9 2.80 0.76 3.28 0.88 3.62 1.02 
10-14 2.33 0.64 3.54 0.84 2.88 0.87 
15-19 5.20 2.01 5.68 2.01 6.18 2.24 
20-24 13-32 4.77 1350 4.62 14.54 2g 
25-29 8.31 4.07 8.42 3.70 9.41 4.65 
30-34 5.02 PBA) 5.42 2.45 5.72 2.06 
35-39 4.88 1.33 5.73 1.62 5.38 1.34 
40-44 4.52 1.33 5.84 Pot 4.67 152 
45-49 3.60 1:22 4.47 be 7 4.78 1.49 
50-54 5.66 1.33 5.86 1.48 6.68 1.54 
55-59 6.11 1.72 6.19 1.78 7.82 1.97 
60-64 8.86 2.44 10.35 2.95 8.91 330 
65-69 10.60 2.66 42.53 3.52 11.44 2.30 
70+ 8.49 1.95 10.19 235 9.33 1.94 


Average 6.20 1.95 7.00 Zet5 7.00 2.08 


$a 
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As seen in Table 2 , on average higher percent errors are associated with areas of small 
population size. The higher percent errors in smaller areas may be associated with the in- 
stability of the smaller (resource based) economies, and associated instabilities in net migra- 
tion distributions. 

By census division, similar error patterns are observed. As seen in Table 3, the average 
absolute percent error across all regions and age groups is 4.83%, 5.19% for males and 5.60% 
for females. The IM is 1.27% for the total, 1.41% for males and 1.35% for females. Again, 
the error is bimodal, with peaks at 20- 29 and 60-69. In addition, the females have higher 
errors than males in the 20-29 age groups, while the reverse is true in the 60-69 age groups. 


Table 2 
School District Error by Population Size 


Total Male Female 


Population AAPE IM AAPE IM AAPE IM 
Grouping (%) (%) (%) (%) (%) (%) 
0-9,999 8.87 3.16 10.14 3.89 10.27 3.65 
10,000-24,999 6.07 2.47 6.92 2.96 6.62 2.58 
25,000 + 3.66 1.67 3.92 1.78 4.09 1.78 
School District 
Average 6.20 195 7.00 P58 7.00 2.08 
Table 3 
Error by Age Group Across Census Division 
1981 Estimated Versus Census 
Total Male Female 
Age AAPE IM AAPE IM AAPE IM 
Group (%) (%) (%) (%) (%) (%) 
0-4 Z.31 0.54 3.20 0.76 2.28 0.58 
5-9 152 0.50 1.71 0.55 2413 0.68 
10-14 1.69 0.39 Pag i) 0.57 2.50 0.60 
15-19 3.81 1.39 3.79 1.30 4.68 1.63 
20-24 9.83 3.07 9.30 2.91 10.90 3.41 
25-29 7.02 3.04 7.30 2.87 8.09 3.37 
30-34 3.28 1.29 3751 1.43 3.85 ie25 
35-39 3.34 0.66 3.06 0.57 4.21 0.88 
40-44 3.86 0.88 4.29 1.01 4.16 0.90 
45-49 2.91 0.70 3.20 0.75 375 0.83 
50-54 4.82 0.64 4.41 0.75 6.10 0.86 
55-59 5.49 1.34 5.36 155 6.94 1.30 
60-64 7.88 1.95 8.37 2.29 7.94 1.74 
65-69 8.48 1.89 10.30 2.67 9.79 1.43 
70+ 6.16 0.81 7.46 1.20 6.73 0.71 


Avg 4.83 1.27 5419 1.41 5.60 135 
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Table 4 (Census Division Error By Population Size) shows the improvement in error levels 
resulting from aggregating to larger sub-provincial areas. Table 7 illustrates the negative rela- 
tionship between error levels and population size on a Census Division level. 

A comparison of Tables 5 and 6 again demonstrates the improvement in error levels when 
aggregating to larger age/sex cell sizes. Although this does indicate that some precautions 
should be observed when utilizing age/sex estimates for some small areas, we do not believe 
it should preclude use of the estimates for these areas. 


Table 4 


Census Division Error by Population Size 


eee ee ee SS SSS — SS ee 


Total Male Female 
Population AAPE IM AAPE IM AAPE IM 
Grouping (%) (%) (%) (%) (%) (%) 
0-39,000 122 1.94 Teas Poe) 8.79 2.29 10 
40,000-59,999 4.32 1.82 5.03 2.14 4.91 1.83 10 
60,000 + 25) 0.87 293 .98 2.84 0.90 9 
Census Division 
Average 4.83 | 5.19 1.41 5.60 35 29 
Ce ee eS 
Table 5 
- School District - 
Number of Estimates by Error Range 

ee a SSS eS ee 

Average Absolute Percent Error Range 
eee. (dy ee Se SS 2 ST bd ee 

5 to 10 to 
<5 10 15 15+ Total 
No. of Cells 674 239 101 96 1110 
Percent 61% 22% 9% 9% 100% 
Table 6 
- Census Division - 
Number of Estimates by Error Range 

Se ee re ee Ck ae ee 

Average Absolute Percent Error Range 
Shee OQ 3 OR ea Phebe Eee 

5 to 10 to 

<5 10 15 1S+ Total 
No. of Cells 306 Ti] 25 27 435 


Percent 710% 18% 6% 6% 100% 


| 
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6. FINAL REMARKS 


The procedure outlined above has particular advantages for use in a region with well 
developed sources of historical small area population and vital statistics data. It is felt that 
a procedure utilizing net-migration estimates is relatively straightforward, produces accep- 
table error levels, and can produce age/sex estimates soon after the reference date. Although 
the optimal situation would be to have in- and out-migration estimates, currently little infor- 
mation is available on small area migration flows within British Columbia. One further im- 
provement to the system being considered is the incorporation of Old Age Security counts 
to increase the stability and accuracy of estimates in the older age groups. 
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Table 7 


Error by Census Division Across Age Groups 
1981 Estimated Versus Census 


Total Male Female 


Total AAPE IM AAPE IM - AAPE IM 


Census Division Population (%) (%) (%) (%) (%) (%) 


1000 East Kootenay S85 7/25) 4.24 2.04 5.24 2.29 3.88 ANS) 
3000 Central Kootenay 52,045 4.00 2.18 4.03 2.13 5.06 1.69 
5000 Kootenay-Boundary 33,239 232 1223 2.34 1.18 3.21 1.68 
7000 Okanagan-Similkameen 575185 5.04 2.64 6.02 3.08 4.72 2.49 
9000 Fraser-Cheem 56,930 S212 1.60 333 1.78 4.15 2.08 
11000 Central Fraser Valley 115,015 3.14 1.43 3.46 152 3.65 1.81 
13000 Dowdney-Alouette 62,000 2.10 1.15 2.56 ie25 2.20 132 
15000 Greater Vancouver 1,168,700 1.63 0.94 1.68 0.93 1.67 0.98 
17000 Capital 249,475 1.64 0.87 2734 1A | 1.18 0.61 
19000 Cowichen Valley 45,315 3.09 1.66 3.36 1.69 3.85 2.08 
21000 Nanaimo 84,815 3:07 1.58 3.40 1.74 Be22 1.66 
23000 Alberni-Clayoquot 32,560 2D 1.36 2.88 1.27 3.27 1.68 
25000 Comox-Strathcona 68,620 1.44 0.80 1.85 0.87 2.85 1.50 
27000 Powell River 19,050 5.36 2.58 5.06 2.44 6.18 3.03 
29000 Sunshine Coast 16,625 4.84 257 6.79 3.58 5.65 2.81 
31000 Squamish-Liltooet 18,925 1.82 0.99 2.56 137 3.10 1.58 
33000 Thompson-Nicola 102,430 ANS! 1.10 2.07 0.10 2.65 e377. 
35000 Central Okanagan 85,235 3.96 1.93 3.91 1.88 4.32 2.14 
37000 North Okanagan 69,033 5.26 Zee 6.44 3.06 5.05 ZOU 
39000 Columbia-Shuswap 45,425 3.04 1.63 3.56 1.84 2.99 1.66 
41000 Cariboo 58,810 3515 1.93 3.90 2.18 3.42 2.06 
43000 Mount Waddington 14,675 8.96 3.04 5.13 1.59 L777 5.49 
45000 Central Coast 3,050 17.99 7.62 21.62 8.86 14.92 7.34 
47000 Skeena-Queen Charlotte 24,030 4.82 2.09 SO ee 2.58 4.61 1.84 
49000 Kitimat-Stikina 41,790 6.26 1.99 4.99 1.66 8.59 2.78 
51000 Bulkley-Nechako 38,310 6.23 Zeal 5.76 2.10 6.83 Tey | 
53000 Fraser-Fort George 89,430 3.50 1.41 3.39 1.25 3.72 1.68 
55000 Peace River-Liard 55,340 8.00 2.95 9.43 3.05 7.34 2.83 
57000 Stikine 2,685 17215 6.89 17.89 6.88 22.39 8.35 


Average Error 4.83 oo eee zes 5.00 2.51 
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and Census Metropolitan Areas! 
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ABSTRACT 


A methodology has been developed for producing population estimates by single years of age and sex 
for small areas (census divisions and census metropolitan areas). To assure reliability, the estimates 
by single years of age are grouped into five years and only these grouped data are recomended for 
dissemination. They are based on the age-sex composition of population from the last census, births 
by sex, deaths by single years of age and sex, estimates of migration by age and sex, and counts of 
family allowance recipients in the age group 1-14 years. 


KEY WORDS: Cohort-component method; Mean absolute error; Index of dissimilarity; Separation 
factor. 


1. INTRODUCTION 


The objective of this paper is to describe the methodology for estimating population by 
age and sex for small areas (census divisions and census metropolitan areas), present fin- 
dings of the evaluation of estimation methods, and finally to discuss the factors affecting 
the quality of estimates. According to the 1981 Census, the 266 Census divisions ranged in 
population from 2,000 to 2,000,000, and the 24 census metropolitan areas, from 100,000 
to 3,000,000. The description of the estimation methods and principal data sources are 
presented in section 2. The results of the evaluation of migration and population estimates 
are given in section 3. 


2. METHODOLOGY 
The descriptions of the estimation methods, as well as the preparation of the basic input 
data are presented below. 
2.1 Cohort-Component Method 


For each census division (CD) and census metropolitan area (CMA), the cohort-component 
method is used to produce population estimates by age. The equations are as follows: 


For the age 0, PO! = B-fD, + 4M, (1) 
For the age 1, Pi! = Po — [(l1-fo)Do + %D,] + 4(My + M,) (2) 
For ages 2 to 84, Pit) = Pi — “’(D, + D,.,) + 4M, + Mos) (3) 
For ages 85+, Psi = Phas — ’Dey — Dos, + 2Mo, + Megs, (4) 


| Revised version of the paper presented at the Federal-Provincial Committee on Demography meetings held on 
November 28-29, 1985 at Statistics Canada, Ottawa, Canada. This research was undertaken with support from 
the Small Area Data Program of Statistics Canada. 

2 Ravi B.P. Verma, K.G. Basavarajappa and Rosemary K. Bender, Demography Division, Statistics Canada,, 4'" 
floor, Jean Talon Building, Tunney’s Pasture, Ottawa, Ontario, Canada K1A OT6. 
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where f, = Separation factor of deaths at age 0 
M, = Net migrants aged a between time ¢ and ¢+1 
B- = Births between time ¢ and t+1 
D, = Deaths at age a between time ¢ and til 
P‘ = Population aged a at time ¢. 


The cohort-component method is also used at the provincial level by Statistics Canada 
(Statistics Canada, Catalogue No. 91-210), and by the province of British Columbia for pro- 
ducing population estimates by age at the census division, school and health district levels 
(Central Statistics Bureau 1980). 


2.2 Preparation of Basic Input Data 


Since we are proposing to produce preliminary postcensal population estimates within eight 
months after the reference date, final data on components of population change cannot be 
used because they do not become available until after 18 to 24 months. Consequently, estimates 
would have to be used for each component. 


Births and Deaths 


Preliminary estimates of births by sex for year (¢) are obtained by multiplying the propor- 
tional distribution by small areas of provincial total births by sex for year (t — 1) with the 
provincial preliminary total births for year (t). Similarly, preliminary estimates of deaths 
by age and sex for year (¢) are obtained by multiplying the proportional distribution by small 
areas of provincial total deaths by age and sex for year (t — 1) with the provincial preliminary 
total deaths for year (¢). Finally, they are converted into cohort deaths on the assumption 
that dates of birth of those who die and the number of deaths are uniformly distributed over 
a 12 month period except for deaths of age 0. The formulae are as follows: 


For age 0, 
Cohort deaths (0) = deaths (0) x 0.89 
For age 1, 
Cohort deaths (1) = [deaths (0) x 0.11] + [deaths (1) x 0.57 
For ages 2 to 84, 
Cohort deaths (age) = [deaths (age-1) x 0.5] + [deaths (age) x 0.5] 
Cohort deaths (85+) = deaths (84) x 0.5 + deaths (85+). 


In the above formulae, the separation factors (f) are 0.89 for age 0, 0.11 for age 1 and 
0.5 for all other ages. 


Residual Net Migration 


First, the estimates of total population for the postcensal years for CDs and CMAs prepared 
by the regression-nested procedure are split by sex using the sex composition from the latest 
census. The regression-nested procedure is described elsewhere (Statistics Canada, Catalogue 
No. 91-211). For males and females, residual total net migration is computed by taking the 
difference between the population change and the natural increase. For each area, this is 
distributed by five year age groups using migration data by age from three sources: residual 
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net migration from the 1976 and 1981 censuses, migration data from income tax files and 
the 1981 mobility question. The mobility question referred to is ‘‘Where were you on June 
1, 1976?”’ in the 1981 Census. From the responses obtained for this question, in-migrants 
to and out-migrants from each small area can be tabulated. The five year age groups are 
split into single years of age using SPRAGUE multipliers. Before applying Sprague multipliers, 
the residual net migration is first split into in and out migration. Using in and out tax migra- 
tion data as a reference, this calculation is done individually for each five-year age group. 


d ; j Tax Data In-Migration d : i 
Residual In-Migration = ——_______________ x Residual Net Migration 
Tax Data Net Migration 


Residual Out-Migration = Residual In-Migration — Residual Net Migration 


Using the preceding ratios, major problems occur when the split net migration is not of 
the same sign as the reference tax data on net migration. In this case, the sign of the split 
net migration is kept, but the resulting in and out migration are exchanged to yield the ap- 
propriate sign. This is based on the assumption of equal magnitude of a reversal of the migra- 
tion flow. 


2.3 Counts From The Family Allowances File, Ages 1-14 years 


Estimates of population produced by the cohort-component method for the age groups 
1-4, 5-9, 10-14 are replaced by counts of family allowance recipients at these ages which are 
readily available for CDs and CMAs, within 3 to 4 months after the reference date. Family 
allowances are paid universally in Canada and hence the counts are considered to be com- 
plete for all practical purposes. The data on the family allowance recipients are not provided 
by sex. Hence they are split into males and females using the sex composition from the latest 
census. 


2.4 Adjustments for Consistency with Provincial and Census Division Estimates 


Postcensal regression-nested estimate of total population of each CD and CMA become 
available within six months after the reference date. In addition, provincial estimates of 
population also become available by age and sex about the same time. Estimates of popula- 
tion by age and sex prepared as described above for the CDs within each province are con- 


trolled with respect to the census division total population estimates, and to the provincial 


| 
| 
| 


- population estimates by age and sex on a pro rata basis. For the census metropolitan areas, 
_ the age and sex totals are adjusted only to the CMA total population estimate. 


3. EVALUATION 


The evaluation is done with respect to three criteria: (i) accuracy; (ii) timeliness and (iii) 


_ consistency. Each of these is discussed below. 


3.1 Accuracy 


The accuracy of population estimates by age and sex depends to a large extent on the ac- 


curacy of estimation of the age-sex distribution of migrants, as the data on deaths by age 
and sex are considered satisfactory. Thus an evaluation of population estimates by age and 


| sex indirectly throws light on the accuracy of migration estimates by age and sex. The ac- 
curacy is examined by comparing the estimates with the corresponding census counts. 
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Table 1 


Distribution of Census Divisions/CMAs Showing the Accuracy 
of Population Estimates by Age, 1981 


Levels of Mean Absolute Error (%) by Sex 


Methods 
of Males Females 
Migration 

Provinces Estimation Under3 3-5 5-10 10+  Under3 3-5 5-10 10+ 
Newfoundland R 8 2 0 0 10 0 0 0 
M WD 7 1 0 3 3 3 1 
T 0 0 5 5 1 1 3 5 
Prince Edward Island R 1 1 1 0 2 0 0 1 
M 1 72 0 0) 1 js 0 0 
T 0 0 1 2 0 0 i 1 
Nova Scotia R 8 4 3 3 8 5 3 2 
M 3 6 4 5 5 6 2 5 
ig 0 pe 8 8 1 3 i 3 
New Brunswick R 10 1 2 2 7 4 3 1 
M 4 7 3 1 3 8 3 1 
Tt 0 Z 5 8 0 4 5 6 
Quebec R 13 I) 23 13 24 18 19 15 
M 12 26 26 12 17 23 21 15 
Ae 1 5 37 33 3 13 32 28 
Ontario R 30 8 8 7 37 5 4 7 
M 8 16 18 ili 21 10 10 12 
af 0 8 34 11 4 10 31 8 
Manitoba R 4 6 8 5 4 6 8 5 
M 1 5 12 5 0 6 7 10 
A 0 1 8 14 0 1 4 18 
Saskatchewan R 10 5 1 2 9 5 2 2, 
M 1 11 5 1 4 5 5 4 
ar 1 1 10 6 1 1 12 4 
Alberta R 9 3 1 2 8 3 3 1 
M 5 5 3 2 5 4 5 1 
T 0 2 5 8 0 3 5 fi 
British Columbia R 19 3 3 4 23 1 1 4 
M 9 13 jp 5 14 8 3 4 
at 0 0 13 16 0 3 16 10 
CMA R 14 8 2 0 19 3 2 @) 
M ZZ 17 5 0 10 9 4 1 
ake 1 ‘| 13 3 1 10 2 1 


ts Se eh Coe nn the cones Fenner SS ee ee 
Note: R: Residual based age distribution of migrants, 1976-81. 

M: Mobility based age distribution of migrants, 1981. 

T: Annual tax migration data. 
Source: Demography Division, Statistics Canada, 1985. 


For each CD and CMA, three sets of population estimates by age and sex as of June 1, 
1981 produced by using the age distribution of migrants from the three sources (residual 
(1976-81), mobility (1976-1981) and annual tax files) and counts from family allowance files 
as described in sections 2.1 to 2.4 were compared with the 1981 census counts. The differences 
were termed errors and for each small area, a summary index known as the ’’mean absolute 
error’? (MAE) was computed by taking the arithmetic mean of percentage errors disregarding 
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the sign for 16 five year age groups. The smaller the value of this index, the more accurate 
are the estimates. In Table 1, a classification of CDs by provinces and of CMAs is presented 
for four levels of mean absolute error: under 3%, 3-5%, 5-10% and over 10%. Overall, 
it appears that the residual based age distribution of migrants gives better estimates. For 
males, about 66% of the total number of census divisions had an MAE under 5%. For females 
this percentage was slightly higher, at 69%. In contrast, lower percentages were observed 
for the mobility (55% and 57%) and tax migration data (9% and 19%), for males and females, 
respectively. 

For CMAs too, the residual age distribution of migrants seems to give better estimates. 
The proportions of cases with MAE under 3% were 58% and 79% for males and females 
respectively. Mobility and tax based age distributions of migrants ranked second and third 
respectively, for both males and females. 

With the exception of Prince Edward Island, the relative accuracy of the three sets of 
age distribution observed for Canada largely holds good for each province. This is true for 
both males and females. However, in some cases the residual based age distributions seem 
to give results similar to the mobility based distributions. Such similarity was observed for 
males in three provinces (Newfoundland, New Brunswick and British Columbia), whereas 
for females it was found only in New Brunswick. 

It should be noted that the age distribution of migrants derived by the residual method 
uses the census age distributions of 1976 and 1981. Consequently, the population estimates 
as of June 3, 1981 prepared by using the migrant age distribution based on the residual method 
can be expected to be similar to the 1981 census age distribution. Hence, on the basis of 
this comparison we cannot conclude that the migrant age distribution derived by the residual 
method is better than the distribution derived from mobility question or from tax files. 

Table 2 presents the percentage distribution of CD and CMA outliers. The outliers are 
those CDs with an MAE of over 10% and those CMAs over 5%. They are presented by 
sex and the three sources of migrant age distributions. As expected, both for males and 
females, the proportion of outliers is generally low for estimates using residual based age 

_ distribution. On the other hand, the percentage of outliers tends to be high for estimates 
using tax based migration distribution. 


Temporal Stability of the Three Sets of Estimates During Postcensal Years, 1982-1984 


For postcensal years, as there are no standard age distributions with which the estimates 
can be compared, the three population estimates by age and sex are compared with each 
other to learn of the temporal stability among them. A summary index knows as the ‘‘index 
of dissimilarity’’ calculated as half of the sum of absolute differences in two percentage age 

_ distributions is used for this purpose. The range of the index is from 0 to 100. The smaller 

the value, the greater is the similarity between the two distributions compared. The small 
areas are classified into three levels of dissimilarity: (i) the smallest level of difference with 
indices between 0% and 5%; (ii) the medium level of difference with indices between 5 to 
10% and (iii) the outliers showing the index value of 10% and over. The classification of 
CDs is presented in Table 3 and that of CMAs in Table 4. 

_ From Table 3, it appears that all the three population distributions tend to be similar and 

"on average, a high percentage of cases, about 90%, are in the smallest category of differences 
(0%-5%) with only about 7% falling in the 5% to 10% category. 

The percentage of cases with the extreme level of differences (index of dissimilarity ex- 
ceeding 10%) were also examined for the ten provinces and their total. For males, the percen- 
tages of extreme cases were small, 3 to 5% between the residual and mobility based age 
distributions. For females, a relatively higher proportions of outliers were noticed. For other 
/ comparisons, residual vs tax based, and mobility vs tax based, slightly higher proportions 
_ of outliers were found. The results were similar for census metropolitan areas (see Table 4). 
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Table 2 
Percentage of Outliers? Among Census Divisions by Province, and of CMA’s 1981 


_ Meee J WA Een han SS ee ee SSS SS eee 


Males Females 
Newfoundland 0 0 50 0 10 50 
Prince Edward Island 0 0 67 33 0 a3 
Nova Scotia ihe 28 44 11 28 iW) 
New Brunswick 13 7 53 t "| 40 
Quebec 17 16 43 20 20 37 
Ontario 13 21 21 13 23 15 
Manitoba ap) 22 61 2a 43 78 
Saskatchewan 11 6 33 11 22 22 
Alberta 13 13 53 7 | 47 
British Columbia 14 ity 55 14 14 34 
Total 15 16 43 15 20 Sb) 
CMA 8 21 67 8 ay 54 


ee ee ee ee Eee ee 


Note: R: Residual based age distribution of migrants. 

M: Mobility based age distribution of migrants. 

T: Tax based age distribution of migrants. 
@ The outliers are those CDs with MAE of over 10% and those CMAs with MAE of over 5%. 
Source: Table 1. 


Table 3 


Distribution of Census Divisions by Level of Index of Dissimilarity 
Obtained by Comparing the Age Distributions of Population Based on Residual, 
Mobility and Tax Migration Sources, 1982 to 1984 


De ee Eee 


Year/ Males Females 
Index Residual Residual Mobility Residual Residual Mobility 
of vs vs Vs vs vs vs 
Dissimilarity Mobility Tax Tax Mobility Tax Tax 
YEAR 1982 
0-5 245 Zo 242 240 241 234 
5-10 7 13 10 8 5 11 
10+ 8 10 8 12 14 15 
Total 260 260 260 260 260 260 
YEAR 1983 
0-5 235 221 223 230 229 223 
5-10 Py 18 21 10 13 13 
10+ 14 21 16 20 18 24 
Total 260 260 260 260 260 260 
YEAR 1984 
0-5 240 226 229 235 233 231 
5-10 11 16 14 15 13 12 
10+ 9 18 17 10 14 17 
Total 260 260 260 260 260 260 


en ————— 


Source: Demography Division, Statistics Canada, October 1985. 
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Table 4 


Distribution of Census Metropolitan Areas by Level of Index of Dissimilarity 
Obtained by Comparing the Age Distributions of Population Based on Residual, 
Mobility and Tax Migration Sources, 1982 to 1984 


Year/ Males Females 
Index Residual Residual Mobility Residual Residual Mobility 
of vs vs vs vs vs vs 
Dissimilarity Mobility Tax Tax Mobility Tax Tax 
YEAR 1982 
0-3 24 24 24 23 DB 23 
3-5 0 0 0 0 1 0 
5+ 0 0 0 1 1 1 
Total 24 24 24 24 24 24 
YEAR 1983 
0-3 22 23 22 21 20 20 
3-5 2 0 0 1 2 3 
5+ 0 1 2 2 2 
Total 24 24 24 24 24 24 
YEAR 1984 
0-3 ay) Zi 21 21 20 20 
3-5 2 9) 2 0 1 0 
5+ 1 1 1 3 3 4 
Total 24 24 24 24 24 24 


Source: Demography Division, Statistics Canada, October 1985. 


In conclusion, it may be said that although the three age distributions of migrants (residual, 
mobility and tax based) differed from each other, age distributions of population resulting 
from these were largely similar. 


3.2 Timeliness 


Timeliness refers to the availability of estimates within as short a time as possible after the 
reference date. Using the preliminary population totals (regression-nested estimates) which 
become available within six months from the reference date, the estimated numbers of births, 
deaths by age and net migrants by age as described in Sections 2.1 to 2.4, the population estimates 
by age and sex for CDs and CMAs could be prepared within eight months of the reference date. 


3.3 Consistency 


Consistency refers to the consistency in the sources of data sets used for estimation at various 
levels of administrative or other disaggregated areas and to the uniformity in the methods 
of estimation. While in certain cases, a different method may have to be used, it is highly 
desirable to use the same method throughout in order to ensure the methodological consistency 
of various levels of geographic disaggregation. 

For provinces, CDs and CMAs, the sources of data are the same for births and deaths: the 
vital registration records. For migration data too, the sources are the same namely, tax files 
and mobility data from the census for all levels of geographic disaggregation. However, an ad- 
ditional data set, the residual age data derived from the two consecutive censuses is also used. 

There is full methodological consistency between provinces and other levels as the cohort- 
component method is used in all cases. 
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4. CONCLUSION 


By using the cohort-component method, three sets of estimates by age and sex have been 
prepared for CDs and CMAs. Each set uses a different migration component by age and 
sex: (i) tax file based; (ii) mobility data from the latest census and (iii) the residual derived 
from the two consecutive censuses. 

Although the three age distributions of migrants differ from each other, the resulting 
estimates of population by age and sex were largely similar. Each set involves its own assump- 
tions. Using a residual age distribution of migrants for postcensal estimation assumes that 
the age distribution remains constant for the period of estimation. A similar assumption is 
involved in using mobility data by age and sex for postcensal years. The data from tax files 
assume that the age-sex distribution remains the same for any two consecutive years. However, 
the type of movement measured by each of these sets is not the same. The residual measures 
only the net movement between the two consecutive censuses (e.g. 1976-81). The mobility 
question also measures five-year movements ranging from 0-4 years. The tax files, on the 
other hand show the movement during roughly a 12 month period. On the basis of the com- 
parisons made in the paper, it cannot be concluded that one migrant data set giving rise to 
population estimates is better than the other. A more satisfactory evaluation of the three 
sets of estimates can be made only when the next census results become available. 
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Experience with Small Area Population Estimates! 


ROSEMARY K. BENDER? 


ABSTRACT 


Statistics Canada’s current methodologies forestimating the population of census divisions and census 
metropolitan areas are the regression-nested and component methods. This paper presents the experience 
with these estimates for the period 1981 to 1985, focusing on problems encountered with the input 
data on family allowance recipients. 


KEY WORDS: Regression-nested estimates; Component estimates; Family allowance recipients; Postal 
code files. 


1. INTRODUCTION 


Statistics Canada’s current methodologies for estimating the population of census divi- 
sions (CDs) and census metropolitan areas (CMAs) are the regression-nested and compo- 
nent methods. The regression estimates for 1982, 1983 and 1985 were published in Catalogue 
No. 91-211 on schedule. Those for 1984 were only made available in March of 1985. There 
was a delay in obtaining the input data on family allowance. Furthermore, as explained be- 
low, we encountered problems with the quality of these data. In particular, the resulting popu- 
lation estimates for CMAs were not acceptable and an alternate methodology had to be used. 

Component estimates of the population for CDs and CMAs have been published in Cata- 
logue No. 91-212 on schedule for 1982 and 1983. We should release the 1984 estimates by 
April 1986. An evaluation of the component estimates produced thus far has shown the data 
to be of good quality. 


2. ADJUSTMENTS 


Since introducing the regression estimates for CDs and CMAs in 1982, some adjustments 


_ to the data and the methodology have been necessary. They are summarized below: 


- For the 1983 estimates for the CD Chicoutimi and the CMA Chicoutimi- Jonquiére 
in the province of Quebec, the family allowance data was adjusted based on the growth 
pattern of the previous year. The problem was traced to postal codes used to obtain 
the family allowance data. 


- In 1984, 17 census divisions estimates were imputed with preliminary component es- 
timates. 


- In 1984, we decided to publish for the CMA of Calgary, estimates based on the annual 
census conducted by the city. This will be done for the entire 1981-1986 period. 


- In 1984, we developed a new methodology for all CMAs other than Calgary, which 
aggregates census division regression estimates. This will be used for the entire 1981-1986 
period. 


The following sections explain the problems encountered in more detail. 


' Abridged version of the paper presented at the meetings of the Federal-Provincial Committee on Demography 
held on November 28-29, 1985, Ottawa, Canada. 

2 Rosemary K. Bender, Demography Division, Census and Demographic Statistics Branch, Statistics Canada, 4" 
floor, Jean Talon Building, Tunney’s Pasture, Ottawa, Ontario, Canada K1A OT6. 
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3. PROBLEMS WITH INPUT DATA FOR REGRESSION ESTIMATES 


There was a delay in producing 1984 estimates due to problems encountered in obtaining 
data on family allowance recipients from Health and Welfare Canada, and the appropriate 
postal code translation files necessary to process these data. 


i) Family Allowance Data 


The numbers of Family Allowance recipients as of June 1, is generally available by mid 
September of each year. The 1984 data from Health and Welfare Canada however, were 
delayed as a result of decentralization of the regional operations of the program in Ontario. 
Problems were also encoutered in the files of all provinces with respect to information on 
effective dates of transfer and reason codes for inter-area transfers. The 1984 data were releas- 
ed to Statistics Canada in an unedited form in November. Corrective actions were taken by 
Health and Welfare Canada, and Family Allowance data as of June 1, 1985 was on schedule. 


ii) Postal Code Files 


The data on family allowance recipients from Health and Welfare Canada is coded by 
postal code. Therefore, to identify the children receiving family allowance in each CD and 
CMA, a file must be created that groups the postal codes by CD and CMA. This is done 
using a master file that contains all the postal codes in Canada, with detailed geographic — 
codes that are used to assign the postal codes to any level of geographic disaggregation. 

Problems have arisen that were unexpected and in some cases had serious consequences. 
For our estimates, it is important that the postal code files used each year by Health and 
Welfare Canada be consistent with the one that was used to develop the regression model. 
The only change in the file should be the addition of new postal codes. Any shifting of postal 
codes from one region to another can result in changes to the population that do not actual- 
ly occur. | 

The problems we encountered stem from the fact that since we developed our regression 
model, different divisions and departments have produced the postal code files. In 1982 and 
1983, it was done by the Administrative Data Development Division of Statistics Canada. 
In 1984, the Standards Division of Statistics Canada took over the responsability and in 1985 
it was done by Health and Welfare Canada. Each had its own approach resulting in family 
allowance data that was not consistent from year to year. Two different types of problems 
arose. We have resolved the first. However, the second will persist throughout the 1981-1986 
postcensal period. 

The first source of difficulty was the shifting of postal codes from one area to another. 
The master file is created by the Standards Division of Statistics Canada. However, in some 
cases, the CD or CMA geographic code is blank or wrong. For CDs this occurs mostly with 
rural codes, where postal codes often refer to post offices covering large territories across 
CD boundaries. The inclusion of the CMA geographic codes is fairly recent, and the quality 
improves each year. Thus, our initial assumption that the postal code file would be consis- 
tent from year to year was not quite true. There are changes made each year. 

Our files were initially created by the Administrative Data Development Division (ADDD) 
of Statistics Canada. They made changes in their copy of the master file before proceeding 
to group the data. In 1984 the Standards Division took over producing our file. When we 
became aware of the consequences this would have, we developed with ADDD a way to match 
the original master file with the latest master file from Standards Division, adding only the 
new postal codes. Any changes to the CD or CMA codes were ignored. We realise that by 
doing this we do not have the most accurate postal code file available. However, for our 
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purposes, we are interested in the changes to the proportions of children receiving family 
allowance. The effect of using some erroneous, but consistent postal codes is that we include 
or exclude some children from another area in the calculation of proportions. The propor- 
tions would not be significantly different from those using correct postal codes, but would 
change if these children were suddenly excluded or included. 

This process of adding only new codes to our postal code file improved significantly the 
quality of the 1984 family allowance data for census divisions. Only 17 of the 231 regression 
estimates of CDs (excluding those of British Columbia, as they produce their own regression 
estimates) needed to be imputed. Because of the delay in obtaining the data, we were able 
to use preliminary estimates from the component method. For census metropolitan areas, 
there were still inconsistencies, which we beleive are due to a different type of problem. 

When the postal codes are grouped by CDs and CMAs, they are also converted into ranges 
of postal codes. For example, if the postal codes ALA1A1, AIA1A2, A1A1A3 and A1A1A4 
all have the same CMA code, then they will be combined into the range AIAIA1-A1A1A4. 
However, in processing the over 600,000 postal codes, certain assumptions are made, depen- 
ding on the software. If, in the above example A1A1A2 was not there, the program may 
still create the same range, assuming that if AlA1A2 did exist, it would have the same CMA 
code as the others in the range. This type of assumption could alter the family allowance 
data processed for each region. Furthermore, if different softwares are used each year, serious 
inconsistencies can arise. 

We believe this is the major cause for the poor quality in the family allowance data for 
CMAs. The softwares used by the ADDD and Standards Divisions were different. What 
complicated matters even more was that as of 1985, the entire operation is now done by 
Health and Welfare Canada, again using a different software. We therefore had to disregard 
the data and develop an alternative methodology for CMAs. 


4. METHODOLOGICAL CHANGE FOR CMAs 


The CMA estimates previously released for 1982 and 1983 were based on the same 
regression-nested procedures as for census divisions. In the evaluation of the 1984 estimates, 
however, estimates for many census metropolitan areas were found to be inconsistent with 
alternate sources and past growth trends. As described above, the problems seem more related 
to the quality of the input files rather than to methodology. 

Taking into account these inconsistencies as well as comments from the provincial focal 
points, it was decided to use an alternate methodology. This new methodology was previously 

_ developed for estimating various CMA components of population change. It consists of ag- 
gregating census divisions regression estimates, using the ratio of the population of the CMA 
to that of overlapping CDs, as observed the previous year by the component method. In 
comparing estimates for 1981, obtained through this methodology, with the 1981 Census 
counts for census metropolitan areas, an average absolute error of 1.3% as observed, as com- 
pared to 2.3% for the previous methodology. 
) To maintain consistency in methodology for the entire 1981-1986 period, the alternate 
method has been used to derive the CMA estimates for 1982 to 1985, and will be used for 
1986. That is, estimates of population for CMA’s other than Calgary are obtained by ag- 
_ gregating the census division regression-nested estimates, and those for Calgary as described 
below, are based on the annual census conducted by the city. 
| In 1984, it was found that the regression-nested estimates for Calgary CMA for 1982 and 
1983 were too high in comparison with the census counts conducted annually by the city 
_of Calgary. The component estimates also supported the idea of adjusting the regression- 
nested estimates for Calgary. It was decided to publish estimates based on the city of Calgary 
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census count extrapolating the April data to June 1. This is in line with Statistics Canada 
policy where, when there is a complete enumeration, this should be considered over an estimate 
prepared by an indirect procedure, unless there is evidence that the enumerated count is 
suspect. 


5. COMPARISON WITH OTHER DATA SOURCES 


The regression and component estimates are compared with alternative data sources 
whenever possible. We receive from the Saskatchewan and Alberta governments the number 
of people registered in their respective health care programs. These data are used in the regres- 
sion model. However, they are also evaluated for consistency with the family allowance data 
and past growth trends. In most cases they were consistent, and differences were traced to 
the problems encountered with family allowance data. 

The Quebec Bureau of Statistics produces annual population estimates of their ad- 
ministrative regions which are subdivisions of the Quebec CDs. Their data are comparable 
to ours except for the CD of Nouveau Québec. This census division, located in northern 
Quebec, is largely comprised of unorganized territories, and it is difficult to estimate the 
population. The BSQ generally adopts our estimates, though for 1984 it imputed its own 
estimate for Nouveau Québec. 

We also appreciate feedback from users who may have access to specific local area data. 


6. CONCLUSION 


The methods used to produce population estimates for census divisions and census 
metropolitan areas have in general functionned very well. However, in the case of the regres- 
sion estimates, problems with input data made it necessary to impute estimates for certain 
CDs with alternate data, and to revise the methodology for CMAs. 

The problems encountered were mostly related to the family allowance data and the postal 
code files that are necessary to process these data. Most of the problems have been resolved. 
However, as Health and Welfare are now taking over the responsibility of creating the postal | 
code files, the 1986 data may still have problems of consistency and will have to be carefully 
evaluated. 

Despite these problems, the regression methodology with certain adaptations will be used 
to produce estimates for 1986. If, however, we decide to continue with the methodology for 
the 1986-1991 period, we must first ensure that consistent postal code files be processed by 
the same department throughout the period. 
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PREFACE 


This issue is devoted to papers presented at the Methodology Symposium on Missing Data 
in Surveys held at Statistics Canada in Ottawa, April 16-17, 1986. The symposium was joint- 
ly sponsored by Statistics Canada’s Methodology Research Committee and the Laboratory 
for Research in Statistics and Probability at Carleton University. Concern about missing data 
in surveys (due to non-response or unusable responses) has been increasing in recent years. 
The symposium provided a forum for more than 200 professionals from universities, govern- 
ment organizations and the private sector in Canada and the United States to exchange in- 
formation concerning recent theoretical and applied developments. 

The symposium was opened by the Chief Statistician of Canada, Dr. Ivan Fellegi. He 
spoke about the international community’s concern about the growing gap between theoretical 
and applied statistics and commended the organizers for bringing together specialists from 
both fields. While stating that the primary purpose of the conference was to make headway 
in the chosen topic, Dr. Fellegi also noted that the underlying theme was the extent to which 
statistical agencies should be involved in model-building. 

The symposium included four sessions. The first session ‘‘General Issues and Organiza- 
tional Experiences”’ was chaired by L. Kish of the University of Michigan and included presen- 
tations by G. Kalton (University of Michigan), G.B. Gray (Statistics Canada), D.W. Chapman 
(U.S. Bureau of the Census) and L.R. Curtin (U.S. National Center for Health Statistics). 
The chairman of the afternoon session of April 16, ‘‘Design and Estimation’? was M. Hansen 
of Westat Inc. Papers were presented by P.S.R.S. Rao (University of Rochester), S. Michaud 
(Statistics Canada), C.E. Sarndal (University of Montreal), G. Lazarus (Statistics Canada) 
and V.P. Godambe (University of Waterloo). 

The morning session of April 17, ‘‘Item Non-Response and Imputation’’ was chaired by 
M. Moore of the University of Montreal. This session included contributions by D. Rubin 
(Harvard University), P. Giles (Statistics Canada), M.S. Srivastava (University of Toronto) 
and M.A. Hidiroglou (Statistics Canada). The chairman of the final session, ‘‘Case Studies’’, 
was J.N.K. Rao of Carleton University. Papers were presented by S. Hinkins (U.S. Internal 
Revenue Service), V. Tremblay (University of Montreal) and S. Cheung (Statistics Canada). 
The symposium was closed with a general discussion of developments concerning missing 
data in surveys led by J.N.K. Rao (chairman) and a panel including G. Kalton, L. Kish, 
D. Rubin, and I. Sande (Statistics Canada). 

Nine of the symposium papers are included in this issue of the Journal. Additional sym- 
posium papers accepted for publication will appear in the next issue. 
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The Treatment of Missing Survey Data 


GRAHAM KALTON and DANIEL KASPRZYK! 


ABSTRACT 


Missing survey data occur because of total nonresponse and item nonresponse. The standard way to 
attempt to compensate for total nonresponse is by some form of weighting adjustment, whereas item 
nonresponses are handled by some form of imputation. This paper reviews methods of weighting ad- 
justment and imputation and discusses their properties. 


KEY WORDS: Nonresponse; Item nonresponse; Weighting adjustments; Imputation. 
1. INTRODUCTION 


Surveys typically collect responses to a large number of items for each sampled element. 
The problem of missing data occurs when some or all of the responses are not collected for 
a sampled element or when some responses are deleted because they fail to satisfy edit con- 
straints. It is common practice to distinguish between total (or unit) nonresponse, when none 
of the survey responses are available for a sampled element, and item nonresponse, when 
some but not all of the responses are available. Total nonresponse arises because of refusals, 
inability to participate, not-at-homes, and untraced elements. Item nonresponse arises because 
of item refusals, ‘‘don’t knows’’, omissions and answers deleted in editing. 

This paper reviews the general-purpose methods available for handling missing survey data. 
The distinction between total and item nonresponse is useful here since different adjustment 
methods are used for these two cases. In general the only information available about total 
nonrespondents is that on the sampling frame from which the sample was selected (e.g., the 
strata and PSUs in which they are located). The important aspects of this information can 
usually be readily incorporated into weighting adjustments that attempt to compensate for 
the missing data. Hence as a rule weighting adjustments are used for total nonresponse. 
Methods for making weighting adjustments are reviewed in Section 2. 

In the case of item nonresponse, however, a great deal of additional information is available 
for the elements involved: not only the information from the sampling frame, but also their 
responses for other survey items. In order to retain all survey responses for elements with 
some item nonresponses, the usual adjustment procedure produces analysis records that in- 
corporate the actual responses to items for which the answers were acceptable and imputed 
responses for other items. Imputation methods for assigning answers for missing responses 
are reviewed in Section 3. 

In general the choice between weighting adjustments and imputation for handling miss- 
ing survey data is fairly clearcut; there are cases, however, when the choice is not so clear. 
These are cases of what may be termed partial nonresponse, when some data are collected 
for a sampled element but a substantial amount of data is missing. Partial nonresponse can 
arise, for instance, when a respondent terminates an interview prematurely, when data are 
not obtained for one or more members of an otherwise cooperating household (for household 
level analysis), or when a sampled individual provides data for some but not all waves of 
a panel survey. Discussions of the choice between weighting and imputation to compensate 
for wave nonresponse in a panel survey are given by Cox and Cohen (1985) and Kalton (1986). 


' Graham Kalton, Survey Research Center, University of Michigan, Ann Arbor, Michigan, 48106-1248 and Daniel 
Kasprzyk, Population Division, U.S. Bureau of the Census, Washington, D.C., 20233. The authors would like 
to thank the referees for their helpful comments. 
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Although weighting adjustments and imputation are treated as separate approaches in 
the discussion below, they are in fact closely related. The relationship and differences bet- 
ween the two approaches are briefly discussed in Section 4, which also mentions some alter- 
native ways of handling missing survey data. 


2. WEIGHTING ADJUSTMENTS 


Weighting adjustments are primarily used to compensate for total nonresponse. The essence 
of all weighting adjustment procedures is to increase the weights of specified respondents 
so that they represent the nonrespondents. The procedures require auxiliary information on 
either the nonrespondents or the total population. The following four types of weighting 
adjustments are briefly reviewed below: population weighting adjustments, sample weighting 
adjustments, raking ratio adjustments, and weights based on response probabilities. More 
details are provided in Kalton (1983). 


2.1 Population Weighting Adjustments 


The auxiliary information used in making population weighting adjustments is the distribu- 
tion of the population over one or more variables, such as the population distribution by 
age, sex and race available from standard population estimates. The sample of respondents 
is divided into a set of classes, termed here weighting classes, defined by the available aux- 
iliary information (e.g., White males aged 15-24, non-White females aged 25-34, etc.). The 
weights of all respondents within a weighting class are then adjusted by the same multiplying 
factor, with different factors in different classes. The adjustment is carried out in such a 
way that the weighted respondent distribution across the weighting classes conforms to the 
population distribution. 

This type of adjustment is often termed poststratification. That term is avoided here, 
however, because although population weighting resembles poststratification, there is an im- 
portant difference between the two. Like population weighting, poststratification weights 
the sample to make the sample distribution conform to the population distribution across 
a set of classes (or strata). However, the standard textbook theory of poststratification is 
concerned only with the sampling fluctuations that cause the sample distribution to deviate 
from the population distribution, not with the more major deviations that can arise from 
varying response rates across the classes. Poststratification adjustments are more like a fine 
tuning of the sample, resulting generally in only small variations in the weights across strata. 
In consequence, provided that the strata are not small, poststratification leads to lower stan- 
dard errors for the survey estimates. In contrast, population weighting adjustments may in- 
volve more major adjustments and result in higher standard errors. 

Population weighting adjustments attempt to reduce the bias created by nonresponse and 
coverage errors. Consider the estimation of a population mean Y from a sample in which 
the elements are selected with equal probability. Suppose that the population is divided into 
a set of weighting classes, with a proportion W,, of elements in class A. Assume that 
respondents always respond and that nonrespondents never do. Let R;, and M, be the pro- 
portions of respondents and nonrespondents respectively in class h, and let R = LW, R,, be 
the overall response rate. Then, following Thomsen (1973), the bias of the unadjusted respon- 
dent mean (¥) can be expressed as 


B(¥) = R-'YVIWi( Pin — ¥,)(Rn — RB) + DWM (Yin — Yn) = A+B (1) 
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where Y,,, and Y,,, are the means for respondents and nonrespondents in class A respective- 
ly, and Y, is the population mean for the respondents. The use of the population weighting 
adjustment leads to the weighted sample mean, Vp = LUWiY», where J,, is the respondent 
sample mean in class h. The bias of ¥, is simply the second term in B(j), that is, 
B(y,) = B. 

If A and B are of the same sign, the population weighting adjustment reduces the ab- 
solute bias in the estimate of Y by |A|. If Y,, = Y,,,, aS occurs in expectation when the 
nonrespondents are missing at random within the weighting classes, then B = 0. In this case, 
the population weighting adjustment eliminates the bias. The term A is a covariance-type 
term between the class response rates and the class respondent means. It is zero if either 
the response rates or the respondent means do not vary between classes. In either of these 
cases, the population weighting adjustment has no effect on the bias of the estimator. It 
may be noted that population weighting adjustments may increase the absolute bias of the 
estimate of Y. This will occur when A and B are of opposite signs and |A| < 2|B]. 

Population weighting adjustments require external data on the population distributions 
for the variables to be used. Care is needed to ensure that the data on which the population 
distributions are based are exactly comparable with the survey data; otherwise, inappropriate 
weights will result. Since the procedure weights up to population distributions, it does more 
than just attempt to compensate for nonresponse. It also compensates for coverage errors 
and makes a poststratification adjustment. 


2.2 Sample Weighting Adjustments 


As with population weighting adjustments, with sample weighting adjustments the sam- 
ple is divided into weighting classes; varying weights are then assigned to these classes in 
an attempt to reduce the nonresponse bias. The essential difference between the two pro- 
cedures lies in the auxiliary information used. As described above, population weighting ad- 
justments are based on externally obtained population distributions. No data are needed for 
the sample nonrespondents. In contrast, sample weighting adjustments employ only data 
internal to the sample and require information about the nonrespondents. 

With sample weighting adjustments, the nonresponse adjustment weights for the weighting 
classes are made proportional to the inverses of the response rates in the classes. In order 
to compute these response rates, the numbers of respondents and nonrespondents in the classes 
must be determined. It is therefore necessary to know to which class each respondent and 
nonrespondent belongs. Since typically very little information about the nonrespondents is 
available, the choice of weighting class is usually severely restricted. It is often limited to 
general sample design variables (e.g., PSUs and strata), characteristics of those variables 
(e.g., urban/rural, geographical region), and sometimes some additional variables available 
on the sampling frame. On occasion it may also be possible to collect information on one 
or two variables for the nonrespondents, for instance by interviewer observation. 

As population weighting adjustments resemble poststratification, so sample weighting ad- 
justments resemble two-phase sampling. The first phase sample is the total sample of 
respondents and nonrespondents; the second phase sample is the subsample of respondents, 
selected with different sampling fractions (response rates) in different strata (weighting classes). 
The sample weighted mean can be represented by ¥, = Lw,y,,, where w,, is the proportion 
of the total sample in weighting class A. Assuming no coverage errors, E(w,) = W,, the 
population proportion in class A, as used in the population weighted estimator 
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Jp = UW2I,n. The bias of ¥, is the same as that of ¥,, namely B(¥,) = Bas given in equa- 
tion (1); hence the effect of the sample weighting adjustment on the bias of the survey estimate 
is the same as that of the population weighting adjustment. Since sample weighting ad- 
justments use only data for the sample, they do not compensate for coverage errors (unlike 
population weighting adjustments). 

Population and sample weighting adjustments have different data requirements, and hence 
address different potential sources of bias. In practice the two forms of adjustment are used 
in combination. Generally sample weighting adjustments are applied first, and then popula- 
tion weighting adjustments are applied afterwards. A common approach is initially to deter- 
mine the sample weights needed to compensate for unequal selection probabilities, next to 
revise these weights to compensate for unequal response rates in different sample weighting 
classes (e.g., urban/rural classes within geographical regions), and finally to revise the weights 
again to make the weighted sample distribution for certain characteristics (e.g., age/sex) con- 
form to the known population distribution for those characteristics. The use of this approach 
in the U.S. Current Population Survey is described by Bailar et al. (1978). 

As with population weighting adjustments, the aim of sample weighting adjustments is 
to reduce the bias that nonresponse may cause in survey estimates. An effect of sample 
weighting adjustments is, however, to increase the variances of the survey estimates. There 
is therefore a trade-off to be made between bias reduction and variance increase. 

An indication of the amount of increase in variance from weighting can be obtained by 
considering the situation where the element variances within the weighting classes are all the 
same and the variances between the class means are negligible compared to the within-class 
variances. In this situation, the loss of precision from weighting is approximately the same 
as that arising from the use of disproportionate stratified sampling when proportionate 
stratified sampling is optimum; Kish (1965, Section 11.7C; 1976) discusses this latter case. 

Under the above conditions, weighting increases the variance of a sample mean by ap- 
proximately L = (LW,k;,) (LW;,/k;,), where W,, is the proportion of the population and 
kK, is the weight for class A. An alternative expression for L is (Ln,) (Ln,k7) / (Onpkaye. 
where n;, is the sample size in class h. The factor L becomes large when the variance of the 
weights is large. 

A large variance in the weights can arise from segmenting the sample into many weighting 
classes with only a few sampled elements in each. When the weighting classes are small, their 
response rates are unstable, and this gives rise to a large variation in the weights. To avoid 
this effect, it is common practice to limit the extent to which the sample is segmented. Even 
so, there may still be some weighting classes that require large weights. Sometimes these 
weighting classes are handled by collapsing them with adjacent ones and sometimes their 
weights are cut back to some acceptable maximum value (see Bailar et al. 1978 and Chap- 
man et al. 1986, for examples). These procedures avoid the increase in variance associated 
with the use of extreme weights, but they may lead to increased bias; their effect on the bias 
is, however, unknown. 

In some cases it seems desirable to use several auxiliary variables in forming the weighting 
classes for population or sample weighting adjustments. However, if the classes are formed 
by taking the full crossclassification of the variables, there will be a large number of weighting 
classes. Unless the sample is very large, the sample sizes in the resultant weighting classes 
will be small, and the instability in the response rates will lead to a large variance in the weights 
and loss of precision in the survey estimates. One way to deal with this problem is to cut 
down on the number of classes by collapsing cells, for instance by discarding some of the 
auxiliary variables or using coarser classifications. Another way is to base the weights on 
a model, as is done in raking ratio weighting discussed below. 
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2.3 Raking Ratio Adjustments 


When weighting classes are taken to be the cells in the crossclassification of the auxiliary 
variables, population weighting adjustments make the joint distribution of the auxiliary 
variables in the sample conform to that in the population. Similarly, sample weighting ad- 
justments make the joint distribution of the auxiliary variables in the respondent sample con- 
form to that in the total sample. As noted above, however, this crossclassification approach 
may have the undesirable effect of creating many small, and hence unstable, weighting classes. 
Also, it is not always possible to employ this approach with population weighting adjustments: 
in many cases the population marginal distributions, and perhaps some bivariate distribu- 
tions, of the auxiliary variables are available, but the full joint distribution is unknown. 

An alternative approach is to develop weights that make the marginal distributions of 
the auxiliary variables in the sample conform to marginal population distributions (with 
population weighting) or marginal total sample distributions (with sample weighting), without 
ensuring that the full joint distribution conforms. The method of raking ratio estimation, 
or raking, may be used to obtain weights that satisfy these conditions. Raking corresponds 
to iterative proportional fitting in contingency table analysis (see, for instance, Bishop et 
al., 1975). 

Consider the use of raking in the simple case of two auxiliary variables. Let Wx be the 
proportion of the population in the (A, k)-th cell of the crossclassification, and let Wrz be 
the proportion assigned to that cell by the raking algorithm. Conditional on the total and 
respondent sample sizes in the cells (and assuming all cells have at least one respondent), 
the bias of the raking ratio adjusted sample mean Vq = LUWagVnx IS 


B(¥q) = Vy MiMi ( Lene oe Gait) as UT: Wi ICY etry ae i ays) 


where W,, = E(W,,). The first term in this bias corresponds to the bias term B in equa- 
tion (1) for the population and sample weighting adjustments. It is zero in expectation if 
the cell nonrespondents are random subsets of the cell populations. The second term is zero 
if either W,, = W,, or there is no interaction in the Y,,x for this classification. 

Underlying the raking ratio weighting procedure is a logit model for the cell response rates. 
With the model In[Rj,/(1 — Ryx)] = a, + By for the response rates in a two-way 
classification, W,, = W,,. Thus, under this model, the second term in B(¥q) is zero. 

Further discussion of raking ratio weighting is given by Oh and Scheuren (1978a,1978b, 
1983). Oh and Scheuren (1978a) also provide a bibliography on raking. 


2.4 Weighting with Response Probabilities 


Although a number of methods for weighting with response probabilities have been pro- 
posed, this approach has not been widely adopted as an adjustment procedure. The basis 
of the approach is to assume that all population elements have probabilities (usually required 
to be non-zero) of responding to the survey. Some method is used to estimate the response 
probabilities for responding elements. These elements are then given nonresponse adjust- 
ment weights that are in inverse proportion to their estimated response probabilities. 

An early application of this approach is the well-known procedure of Politz and Sim- 
mons (1949, 1950). A single (evening) call is made to each selected household, and during 
the course of the interview respondents are asked on how many of the previous five evenings 
they were at home at about the same time. Their response probabilities are then taken to 
be the fraction of the six evenings (including the one of the interview) that they were at home, 
and the inverses of these probabilities are used in the analysis. Note that the procedure does 
not deal with those who were out on all six evenings and those who refused. 
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Another approach for estimating response probabilities is to regress response status (1 
for respondents, 0 for nonrespondents) on a set of variables available for both respondents 
and nonrespondents, using a logistic or probit regression. The predicted values from the regres- 
sion for the respondents are then taken to be their response probabilities, and weights in 
inverse proportion to these predicted values are used in the analysis. A special case is when 
the predictor variables are dummy variables that identify a set of classes. The predicted 
response probabilities are then the class response rates, and the method reduces to a sample 
weighting adjustment. The method is most appropriate for situations where a good deal of 
information is available for the nonrespondents, as for instance when the nonrespondents 
are losses after the first wave of a panel survey. Little and David (1983) discuss the applica- 
tion of the method for panel nonresponse. It should be noted that if the regression is highly 
predictive of response status, the resultant weights will vary markedly, leading to a substan- 
tial loss in the precision of the survey estimates. 

Drew and Fuller (1980, 1981) describe an approach for estimating response probabilities 
from the number of respondents secured at successive calls. In their model, the population 
is divided into classes. Within each class, every element is assumed to have the same response 
probability which remains the same at each call. The model also allows for a proportion 
of hard-core nonrespondents that is assumed constant across classes. Under these assump- 
tions, the response probabilities for each class and the proportion of hard-core nonrespondents 
can be estimated, and hence weighting adjustments can be made. Thomsen and Siring (1983) 
adopt a similar approach using a more complex model. 

Finally, mention should be made of a related approach that compensates for nonresponse 
by weighting up difficult-to-interview respondents. Bartholomew (1961), for instance, pro- 
posed making only two calls in a survey, and weighting up the respondents at the second 
call to represent the nonrespondents. The assumption behind this approach is that the 
nonrespondents are like the late respondents. This assumption seems questionable, however, 
and empirical evidence from an intensive follow-up study of nonrespondents in the U.S. Cur- 
rent Population Survey does not support it (Palmer and Jones 1966; Palmer 1967). 


3. IMPUTATION 


A wide variety of imputation methods has been developed for assigning values for miss- 
ing item responses. The aim here is to provide a brief overview of the methods, the basic 
differences between them, and some of the issues involved in imputation. A fuller treatment 
is provided by Kalton and Kasprzyk (1982). 

Imputation methods can range from simple ad hoc procedures used to ensure complete 
records in data entry to sophisticated hot-deck and regression techniques. The following are 
some common imputation procedures: 


(a) Deductive imputation. Sometimes the missing answer to an item can be deduced with 
certainty from the pattern of responses to other items. Edit checks should check for con- 
sistency between responses to related items. When the edit checks constrain a missing 
response to only one possible value, deductive imputation can be employed. Deductive 
imputation is the ideal form of imputation. 

(b) Overall mean imputation. This method assigns the overall respondent mean to all miss- 
ing responses. 

(c) Class mean imputation. The total sample is divided into classes according to values of 
the auxiliary variables being used for the imputation (comparable to weighting classes). 
Within each imputation class the respondent class mean is assigned to all missing responses. 
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(d) Random overall imputation. A respondent is chosen at random from the total respon- 


dent sample, and the selected respondent’s value is assigned to the nonrespondent. This 
method is the simplest form of hot-deck imputation, that is an imputation procedure 
in which the value assigned for a missing response is taken from a respondent to the cur- 
rent survey. 


(e) Random imputation within classes. In this hot-deck method, a respondent is chosen at 


(f) 


random within an imputation class, and the selected respondent’s value is assigned to 
the nonrespondent. 

Sequential hot-deck imputation. The term sequential hot-deck imputation is used here 
to describe the procedure used with the labor force items in the U.S. Current Population 
Survey (Brooks and Bailar 1978). The procedure starts with a set of imputation classes. 
A single value for the item subject to imputation is assigned for each class (perhaps taken 
from a previous survey). The records in the survey’s data file are then considered in turn. 
If a record has a response for the item in question, its response replaces the value stored 
for the imputation class in which it falls. If the record has a missing response, it is assign- 
ed the value stored for its imputation class. 

The hot-deck method is similar to random imputation within classes. If the order of 
the records in the data file were random, the two methods would be equivalent, apart 
from the start-up process. The non-random order of the list generally acts to the benefit 
of the hot-deck method since it gives a closer match of donors and recipients provided 
that the file order creates positive autocorrelation. The benefit is, however, unlikely to 
be substantial. 

The sequential hot-deck suffers the disadvantage that it may easily make multiple uses 
of donors, a feature that leads to a loss of precision in survey estimates. Multiple use 
of a donor occurs when, within an imputation class, a record with a missing response 
is followed by one or more other records with missing responses. The number of imputa- 
tion classes that can be used with the method also has to be limited in order to ensure 
that donors are available within each class. 

Useful discussions of the sequential hot-deck method are provided by Bailar et al. 
(1978), Bailar and Bailar (1978, 1983), Ford (1983), Oh and Scheuren (1980), Oh et al. 
(1980), and Sande (1983). 


(g) Hierarchical hot-deck imputation. The above disadvantages of the sequential hot-deck 


are avoided in the hierarchical hot-deck method, a form of hot-deck imputation developed 
for the items in the March Income Supplement of the Current Population Survey. The 
procedure sorts respondents and nonrespondents into a large number of imputation classes 
from a detailed categorization of a sizeable set of auxiliary variables. Nonrespondents 
are then matched with respondents on a hierarchical basis, in the sense that if a match 
cannot be made in the initial imputation class, classes are collapsed and the match is made 
at a lower level of detail. Coder (1978) and Welniak and Coder (1980) provide further 
details on the hierarchical hot-deck procedure. 


(h) Regression imputation. This method uses respondent data to regress the variable for which 


(i) 


imputations are required on a set of auxiliary variables. The regression equation is then 
used to predict the values for the missing responses. The imputed value may either be 
the predicted value, or the predicted value plus some residual. There are several ways 
in which the residual may be obtained, as discussed later. 

Distance function matching. This hot-deck method assigns a nonrespondent the value 
of the ‘‘nearest’’ respondent, where “‘nearest”’ is defined in terms of a distance function 
for the auxiliary variables. Various forms of distance function have been proposed (e.g., 
Sande 1979; Vacek and Ashikago 1980), and the function can be constructed to reduce 
the multiple use of donors by incorporating a penalty for each use (Colledge et al. 1978). 
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Although at first sight these may appear a diverse set of procedures, they can nearly all 
be fitted within a single unifying framework. The methods can all be described, at least ap- 
proximately, as special cases of the general regression model 


Van Ore YS oemij Sala Fy (2) 


where Y,,,; is the imputed value for the ith record with a missing y value, Z,,;; are values reflec- 
ting the auxiliary variables for that record, b,, and b,; are the regression coefficients for the 
regression of y on x for the respondents, and @,,; is a residual chosen according to a specified 
scheme for the particular imputation method. 

Equation (2) represents the regression imputation method in an obvious way. If the é,,,;’s 
are set at zero, then the imputed value is the predicted value from the regression; otherwise 
a residual of some form may be added. The equation also represents class mean imputation 
by defining the z,;’s to be dummy variables that represent the classes, and setting é,,; = 0. 
The regression equation then reduces to ¥,,; = ¥,,, the class mean. Random imputation 
within classes is obtained by adding a residual to the class mean, where the residual is the 
deviation from the class mean for one of the respondents. Then ¥,,; = ¥,, + €-nx, where 
€-nx iS the deviation for respondent k in class h; this reduces to Jn; = ¥;nx, the value for that 
respondent. The sequential and hierarchical hot-deck methods resemble the random within 
class method. The overall mean and random overall imputation methods are degenerate cases 
of the class mean and random within class methods that use no auxiliary information. 

An important consideration in the choice of imputation method is the type of variable 
being imputed. All the above methods can be applied routinely with continuous variables, 
but some of them are not suitable for use with categorical or discrete variables (such as being 
a member of the labor force (1) or not (0), and the number of completed years of educa- 
tion). Overall mean, class mean, and regression imputations impute values like 0.7 for being 
a member of the labor force (i.e., a 70% chance) and 10.7 for the number of completed 
years of education. These values are not feasible for individual respondents, and rounding 
them to whole numbers leads to bias. For this reason, these imputation methods do not work 
well for categorical and discrete variables. A notable advantage of all hot-deck methods is 
that they always give feasible values since the values are taken from respondents. 

There are two major distinguishing features of the above imputation methods that deserve 
elaboration: whether or not a residual is added and, if one is, the form of the residual; and 
whether the auxiliary information is used in dummy variable form to represent classes or 
whether it is used straightforwardly in the regression. These features are discussed in the 
next two subsections. Other issues arising with the use of imputation are then discussed in 
subsequent subsections. 


3.1 Choice of Residuals 


Imputation methods may be classified as deterministic or stochastic according to whether 
the @,,;)s are set at zero or not. For each deterministic imputation method, there is a 
stochastic counterpart. Let f,,;7 be the value imputed by the deterministic method and 
Imis = Pmia + @mi be that imputed by the corresponding stochastic method. Then 
E3(¥mis) = Ymia. Where Ey denotes expectation over the sampling of residuals given the in- 
itial sample, provided that E3(é,,;) = 0 (as generally applies). 

The choice between a deterministic and the corresponding stochastic imputation method 
depends on the form of survey analysis to be conducted. Consider first the estimation of 
the population mean of the y-variable using the sample mean of the respondents’ values and 
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the nonrespondents’ imputed values. As Kalton and Kasprzyk (1982) show, given that 
Ex(Imis) = Pmia> it follows that the expectation of the sample mean is the same whether the 
deterministic method or the corresponding stochastic method is used. Thus both methods 
have the same effect on the bias of the estimate. However, the addition of random residuals 
in the stochastic method causes a loss of precision in the sample mean. Although this loss 
can be controlled by the choice of a suitable method of sampling residuals (Kalton and Kish 
1984), nevertheless some loss in precision occurs. For this reason a deterministic scheme is 
preferable for the purpose of estimating the population mean. 

Consider now the estimation of the element standard deviation and distribution of the 
y-variable. Deterministic imputation methods fare badly for these purposes, since they cause 
an attenuation in the standard deviation and they distort the shape of the distribution. This 
may be simply illustrated in terms of the class mean imputation method. By assigning the 
class mean to all the missing values in a class, the shape of the distribution is clearly distorted 
with a series of spikes at the class means. The standard deviation of the distribution is at- 
tenuated because the imputed values reflect only the between-class and not the within-class 
variance. The appeal of the stochastic imputation methods is that the residual term captures 
the within-class (or residual) variance, and hence avoids the attenuation of the element stan- 
dard deviation and the distortion of the distribution. 

Since some survey analyses are likely to involve the distributions of the variables, stochastic 
imputation methods like the hot-deck methods are generally preferred. Once a decision is 
made to use a stochastic method, the question of how to choose the residuals arises. If the 
standard regression assumptions are accepted, the residuals could be chosen from a normal 
distribution with a mean of zero and a variance equal to the residual variance from the respon- 
dent regression. However, this places complete reliance on the model. An alternative that 
avoids the normality assumption is to choose the residuals randomly from the empirical 
distribution of the respondents’ residuals. Another alternative is to select a residual from 
a respondent who is a ‘‘close’’ match to the nonrespondent, measuring ‘‘close’’ in terms 
of similar values on the auxiliary variables. This attractive alternative avoids the assumption 
of homoscedasticity and guards against misspecification of the distribution of the residual 
term. In the limit, the closest respondent is one who has the same values of all the auxiliary 
variables as the nonrespondent. In this case, the nonrespondent is given one of the matched 
respondents’ values. This case arises with hot-deck methods, where nonrespondents and 
respondents are matched in terms of the auxiliary variables, and nonrespondents are assign- 
ed values from matched respondents. 

A further consideration in the choice of residuals is to make the imputed values feasible 
ones. As noted above, deterministic methods may impute values for categorical and discrete 
variables that are not feasible. Some stochastic methods solve this problem through the alloca- 
tion of the residuals. In particular, the use of respondents’ residuals with the random within 
class and the sequential and hierarchical hot-deck methods ensures that the imputed values 
are feasible ones. 


3.2 Imputation Class or Regression Imputation 


As noted earlier, both imputation class and regression imputation methods fall within the 
imputation model given by equation (2). The difference between them lies in the ways in 
which they employ the auxiliary variables. 

Imputation class methods divide the sample into a set of classes. For this purpose, con- 
tinuous auxiliary variables have to be categorized. There is complete flexibility in the way 
the classes are formed, and the symmetrical use of the auxiliary variables in different parts 
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of the sample is not required. Thus, for instance, in imputing for hourly rate of pay in a 
sample of employees, the sample might first be divided into two parts, union members and 
nonmembers; then the imputation classes for the members might be formed in terms of age 
and occupation whereas those for nonmembers might be formed in terms of sex and industry. 
As arule, the aim is to construct classes of adequate size that explain as much of the variance 
in the variable to be imputed as possible. When the classes are formed by a complete 
crossclassification of the auxiliary variables, the underlying model contains all main effects 
and all interactions for the crossclassification. The limitation of imputation class methods 
is that the number of classes formed has to be constructed to ensure that there is some 
minimum number of respondents in each class. The hierarchical hot-deck method attempts 
to extend the amount of auxiliary data used, but even with this method matches of respondents 
and nonrespondents often cannot be made at the finer levels of detail. Coupled with the 
use of a random respondent residual within a class, imputation class methods have the valuable 
property that imputed values are feasible ones: that is, the imputed values are actual 
respondents’ values. 

Regression imputation methods have an advantage over imputation class methods in the 
number and in the level of detail of the auxiliary variables they can employ. Age can, for 
instance, be taken as a continuous variable rather than being categorized into a few classes. 
The regression model allows more main effects to be included in the model, but at the price 
of fewer interactions. Regression models can, of course, include some interactions, but they 
need to be specified. The models can also include polynomial terms and employ transforma- 
tions, but again they need to be specified. The regression model has the potential of pro- 
viding better predictions for the imputed values, but to achieve this careful modelling is 
required. Careful imputation modelling is unrealistic for all the variables in a survey, but 
it may be feasible for one or two major ones (and especially so for continuous surveys). 
Without careful modelling, there is a serious risk of poor imputations, although as noted 
earlier, this risk can be reduced by the allocation of random residuals from ‘‘close’’ 
respondents. 

If a regression imputation assigns the residual from a respondent with exactly the same 
values of the auxiliary variables, the imputed value is necessarily a feasible one. If, however, 
there is even a small difference between the respondent’s and nonrespondent’s values on the 
auxiliary variables, the imputed value may not be feasible. A variant of regression imputa- 
tion that avoids this problem, termed predictive mean matching, is described by Little (1986b) 
(Little attributes the method to Rubin). With predictive mean matching, the nonrespondent 
is matched to the respondent with the closest predicted value. Then, instead of adding the 
respondent’s residual to the nonrespondent’s predicted value, the nonrespondent is assigned 
the respondent’s value. The method is thus a hot-deck method, and is similar to distance 
function matching. 

The choice between imputation class and regression imputation methods should in part 
depend on the efforts made to develop the regression model. Unless adequate resources are 
devoted to the development of a regression model, the imputation class methods may be 
safer. The choice should also in part depend on the sample size. With large samples, hot- 
deck methods are likely to be able to use enough classes to take advantage of all the major 
predictor variables; however, with small samples this may not hold, and regression methods 
may have greater potential. David et a/. (1986) describe an interesting study that compares 
regression models for imputing wages and salary in the U.S. Current Population Survey with 
hierarchical hot-deck imputations. Despite the extensive efforts made to develop the regres- 
sion models, the hot-deck imputations were not found to be inferior in this large sample. 


3.3 Effect of Imputation on Relationships 


Although most of the literature on imputation deals with its effect on univariate statistics 
such as means and distributions, a large part of survey analysis is concerned with bivariate 
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and multivariate relationships. Here the analysis of relationships can be considered in broad 
terms to include crosstabulation, correlation or regression analysis, comparisons of subclass 
means or proportions, and any other analysis involving two or more variables. As will be 
illustrated below, imputation can have harmful effects on all analyses of relationships, often 
attenuating the associations between variables. Discussions of the effects of imputations on 
relationships are provided by Santos (1981), Kalton and Kaspryzk (1982) and Little (1986a). 

The general nature of the effect of imputation on relationships can be seen by considering 
its effect on the estimate of the sample covariance in the simple situation where the y-variable 
has missing responses that are missing at random over the population and the x-variable has 
no missing data. The sample covariance, s,,, is calculated in the standard way, based on 
the actual values for respondents and the imputed values for nonrespondents, as an estimate 
of the population covariance S,,. Using the fact that E)(¥mis) = Pmiq aS above, it can be 
readily shown that the expected value of s,, under a deterministic imputation method is the 
same as that under the corresponding stochastic method. 

As Santos (1981) shows, the relative bias of s,, when the mean overall or random overall 
imputation methods are used is approximately — M, where M is the nonresponse rate. This 
occurs because the imputed y-values are unrelated to their x-values, and hence the cases with 
imputed values attenuate the covariance towards zero. This attenuation is decreased in 
magnitude by imputation methods that use auxiliary variables. With class mean imputation 
or random imputation within classes, the relative bias is approximately —M(S,, -/S,,), 
where S,,., = LW,S,,, is the average within-class covariance for classes formed by the aux- 
iliary variables z, S,,, is the covariance within class h, and W,, is the proportion of the 
population in class h. With predicted regression imputation or regression imputation with 
a random residual, both with a single auxiliary variable z, the relative bias is approximately 
—M[1l — (pxzPyz/Pxy)], where p,, is the correlation between u and v. 

The disturbing feature of these results is that, unless M is small, Syy calculated with im- 
puted values under any of these imputation methods may be subject to substantial bias even 
under the missing at random model. The estimates s,, computed with imputed values ob- 
tained under the imputation class and regression methods are unbiased only if the partial 
covariance S,,, is zero. In general, there is no reason to assume uncritically that S,, , is zero. 
However, there is an important case when S,,, = 0. This occurs when x = z, that is when 
x is used as an auxiliary variable in the imputation procedure. In this case, the sample 
covariance is unbiased under the missing at random model. This result suggests that if the 
relationship between x and y is to form an important part of the survey analysis, x should 
be used as an auxiliary variable in imputing for missing y-values. 

The above theory assumes that only the y-variable was subject to missing data. In prac- 
tice the x-variable will often also be incomplete. If so, the sample covariance may be at- 
tenuated because of the imputations for both variables. A special feature occurs when x and 
y are both missing for a record. If the two values are imputed separately, the covariance 
is attenuated, but if they are imputed jointly, using the same respondent as the donor of 
both values, the covariance structure is retained. This suggests that when a record has several 
missing related values, they should be taken from the same donor. Coder (1978) describes 
the use of joint imputation from the same donor in the March Income Supplement of the 
Current Population Survey. 

As an illustration of how the above arguments about the attenuation of covariances app- 
ly to other forms of relationships, we will give a simple numerical example of the effect of 
imputation on the difference between two proportions. Let the variable of interest be whether 
an individual has a particular attribute or not, and suppose that one half of the respondents 
fail to answer this question. The missing responses are imputed by a random within class 
imputation method using two classes, A and B. The objective is now to compare the 
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Table 1 


Number of Respondents with the Attribute, and Number of 
Sampled Persons by Class, Sex and Response Status 


Class A Class B 
M F Total M F Total 
ee a Set oe aaron 4s seer ea ee ieee ol ant ates SIs Pee. 
Respondents with the attribute 80 40 120 60 20 80 
Total respondents 100 100 200 100 100 200 
Nonrespondents 100 100 200 100 100 200 
Total sample 200 200 400 200 200 400 


percentages of men and women with the attribute. The data are displayed in Table 1. Since 
60% of the total respondents in class A have the attribute, 60 of the 100 male and 60 of 
the 100 female nonrespondents in that class will be imputed to have the attribute. Similarly, 
in class B 40% of the total respondents have the attribute, and so 40 male and 40 female 
nonrespondents will be imputed to have the attribute. The proportion of actual and imputed 
males with the attribute is thus (80 + 60 + 60 + 40)/400 = 0.6 or 60%. For females the 
corresponding proportion is (40 + 60 + 20 + 40)/400 = 0.4, or 40%. The difference bet- 
ween these two percentages is 20%. 

Had sex also been taken into account in forming the imputation classes, the percentages 
of males and females with the attribute would have been 70% and 30%, differing by 40%. 
The failure to include sex as an auxiliary variable in the imputation has thus caused a substan- 
tial attenuation in the measurement of the relationship between sex and having the attribute. 


3.4 Multiple Imputations 


Ideally the analyst using a data set with imputed values should be able to obtain valid 
results for any analyses by applying standard techniques for complete data. However, as 
noted in the last section, imputation can distort measures of the relationships between 
variables. It also distorts standard error estimation. 

All imputation methods except deductive imputation fabricate data to some extent. The 
extent of fabrication depends on how well the imputation model predicts the missing values. 
If the imputation model explains only a small proportion of the variance in the variable among 
the respondents, the amount of fabrication in each imputed value is likely to be substantial. 
If the imputation model explains a high proportion of the respondent variance, the amount 
of fabrication is likely to be less serious. However, it needs to be recognized that the fit of 
the imputation model for the respondents is not necessarily a good measure of the fit for 
the nonrespondents. 

Standard errors computed in the standard way from a data set with imputed values will 
generally be underestimates because of the fabrication involved in the imputed values. Rubin 
(1978, 1979) has advocated the method of multiple imputations to provide valid inferences 
from data sets with imputed values (see also Herzog and Rubin 1983; Rubin and Schenker 
1986). When multiple imputations are used for the purpose of standard error estimation, 
the construction of the complete data set by imputing for the missing responses is carried 
Out several (say m) times using the same imputation procedure. The sample estimates 
Zz; (¢ = 1, 2, ..., m) of the population parameter of interest Z are computed from each of 
the replicate data sets, and their average Z is calculated. A variance estimator for Z is then 
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given by V = W + [(m + 1)/m]B, where Wis the average of the within-replicate variance 
of zand B = £(z; — Z)*/(m — 1) is the between-replicate variance. Even with the inclu- 
sion of the between-replicate variance component, however, the coverages of confidence in- 
tervals for Z based on V are still overstated, with the amount of overstatement increasing 
with the level of nonresponse. 

This overstatement of the confidence levels can be addressed by modifying the imputa- 
tion procedure, as described by Rubin and Schenker (1986). Their treatment considers the 
random overall imputation method, and one of their modifications allows for uncertainty 
about the population mean and variance in the following way. With the standard random 
overall imputation method, the conditional expected mean and variance of the imputed values 
are the sample respondents’ mean and variance. With the modification, the expected mean 
and variance of the imputed values for a replicate are drawn at random from appropriate 
distributions. The imputed values are then a random selection of respondents’ values, modified 
for the randomly-chosen mean and variance. When estimating the population mean, the ef- 
fect of the changing expected mean and variance between replicates is to increase the between- 
replicate variance component in V. This increase gives improved coverage for the resultant 
confidence intervals. 

A major problem with the use of multiple imputations is the additional computer analysis 
needed, which increases as the number of replicates, m, increases. For this reason, a small 
value of m, such as m = 2, may be preferred. A small value of m may, however, result in 
a low level of precision for the variance estimator. Even with small m, it is questionable 
whether the multiple imputation approach is feasible for routine analyses. It may be best 
reserved for special studies, such as that described by Herzog and Rubin (1983). 

In addition to providing appropriate standard errors, another advantage of multiple im- 
putations from the same imputation procedure is that it reduces the loss of precision in survey 
estimates arising from the random selection of respondents to act as donors of imputed values 
(see Section 3.1). This loss is reduced with multiple imputations by averaging over the 
replicates. A small number of replicates serves well for this purpose. As noted earlier, Kalton 
and Kish (1984) describe alternative ways of selecting the sample of respondents to achieve 
this end. 

A second major potential application of multiple imputations is to generate the imputa- 
tions for the several replicates by different imputation procedures, making different assump- 
tions about the nonrespondents. Suppose, for instance, that hourly rates of pay are to be 
imputed for some earners in the sample. One procedure that might be used is the random 
within class imputation method, which is based on an assumption that nonrespondents are 
missing at random within the classes. If it is thought that the nonrespondents might in fact 
come more heavily from those with higher rates of pay in each class, a simple modification 
to the random within class method might be to impute values that are, say, 5O cents above 
the donors’ values. Other imputation procedures - for instance, using different imputation 
classes - could also be tried. Comparison of the survey estimates obtained from the data 
sets in which the different imputation procedures are applied then provides a valuable in- 
dication of the sensitivity of the estimates to the values imputed. If the estimates turn out 
to be very similar, they can be accepted with greater confidence; if they differ markedly, 
the estimates need to be treated with considerable caution. 


4. CONCLUDING REMARKS 


Weighting and imputation have been presented as two distinct methods for handling missing 
survey data, but in fact there is a close relationship between them. This may be illustrated 
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by considering any imputation method that assigns respondents’ values to the nonrespondents. 
For univariate analyses, this process is equivalent to dropping the nonrespondents’ records 
and adding the nonrespondents’ weights to those of the donor respondents (Kalton 1986). 

The differences between weighting and imputation emerge when one considers the 
multivariate nature of survey data. It is possible to impute for the responses of a total 
nonrespondent by taking all the responses from a single donor; however, weighting is generally 
simpler in this case and it avoids the loss of precision arising from the sampling of respondents 
to serve as donors. It is not practicable to use weighting to handle item nonresponse since 
it would result in different sets of weights for each item; this would cause serious difficulties 
for crosstabulations and other analyses of the relationships between variables. 

Weighting is a single global adjustment that attempts to compensate for the missing 
responses to all the items simultaneously. Imputation, on the other hand, is item-specific. 
This difference has consequences for the way that the auxiliary data are used. In forming 
weighting classes, the focus is on determining classes that differ in their response rates. The 
choice of auxiliary variables to use in imputation, however, is primarily made in terms of 
their abilities to predict the missing responses. 

An assumption underlying all the procedures reviewed in this paper is that once the aux- 
iliary variables have been taken into account the missing values are missing at random. Thus, 
for instance, the nonrespondents are assumed to be like the respondents within weighting 
and imputation classes. This assumption can be avoided by using stochastic censoring models, 
as has been done by Greenlees ef a/. (1982) in imputing wages and salaries in the Current 
Population Survey. However, as Little (1986b) observes, these models are highly sensitive 
to the distributional assumptions made. 

An alternative approach for handling missing survey data is to leave the values missing 
in the data set and let the analyst incorporate appropriate missing data models into the analysis 
(Little 1982). This approach has much to commend it, but the labor and computing time 
needed to implement it effectively preclude its use as a general purpose strategy. Rather, 
the approach seems best suited for a small range of special analyses. In order to permit the 
analyst to adopt this approach, it is essential that all imputed values be flagged to indicate 
they are not actual responses, so that they can then be dropped from the analysis. 

Finally, we should note that all methods of handling missing survey data must depend 
upon untestable assumptions. If the assumptions are seriously in error, the analyses may 
give misleading conclusions. The only secure safeguard against serious nonresponse bias in 
survey estimates is to keep the amount of missing data small. 


REFERENCES 


BAILAR III, J.C., and BAILAR, B.A. (1978). Comparison of two procedures for imputing missing 
survey values. Proceedings of the Section on Survey Research Methods, American Statistical Associa- 
tion, 462-467. 


BAILAR, B.A., and BAILAR III, J.C. (1983). Comparison of the biases of the hot-deck imputation 
procedure with an ‘‘equal-weights’’ imputation procedure. In Incomplete Data in Sample Surveys, 
Volume 3, Proceedings of the Symposium, (Eds. W.G. Madow and I. Olkin), New York: Academic 
Press, 299-311. 


BAILAR, B.A., BAILEY, L., and CORBY, C.A. (1978). A comparison of some adjustment and 
weighting procedures for survey data. In Survey Sampling and Measurement, (Ed. N.K. Namboodiri), 
New York: Academic Press, 175-198. 


BARTHOLOMEW, D.J. (1961). A method of allowing for ‘not at home’ bias in sample surveys. Ap- 
plied Statistics, 10, 52-59. 


Survey Methodology, June 1986 TS 


BISHOP, Y.M.M., FIENBERG, S.E., and HOLLAND, P.W. (1975). Discrete Multivariate Analyses. 
Cambridge, Mass: The MIT Press. . 


BROOKS, C.A., and BAILAR, B.A. (1978). An Error Profile: Employment as Measured by the Cur- 
rent Population Survey. Statistical Policy Working Paper 3. U.S. Department of Commerce. 
Washington, D.C.: U.S. Government Printing Office. 


CHAPMAN, D.W., BAILEY, L., and KASPRZYK, D. (1986). Nonresponse adjustment procedures 
at the U.S. Census Bureau. Survey Methodology, forthcoming. 


CODER, J. (1978). Income data collection and processing from the March Income Supplement to the 
Current Population Survey. The Survey of Income and Program Participation Proceedings of the 
Workshop on Data Processing, February 23-24, 1978, (Ed. D. Kasprzyk), Chapter II. Washington, 
D.C.: U.S. Department of Health, Education and Welfare. 


COLLEDGE, M.J., JOHNSON, J.H., PARE, R., and SANDE, I.G. (1978). Large scale imputation 
of survey data. Proceedings of the Section on Survey Research Methods, American Statistical Associa- 
tion, 431-436. 


COX, B.G., and COHEN, S.B. (1985). Methodological Issues for Health Care Surveys. New York: 
Marcel Dekker. 


DAVID, M., LITTLE, R.J.A., SAMUHEL, M.E., and TRIEST, R.K. (1986). Alternative methods 
for CPS income imputation. Journal of the American Statistical Association, 81, 29-41. 


DREW, J.H., and FULLER, W.A. (1980). Modelling nonresponse in surveys with callbacks. Pro- 
ceedings of the Section on Survey Research Methods, American Statistical Association, 639-642. 


DREW, J.H., and FULLER, W.A. (1981). Nonresponse in complex multiphase surveys. Proceedings 
of the Section on Survey Research Methods, American Statistical Association, 623-628. 


FORD, B.L. (1983). An overview of hot-deck procedures. In Incomplete data in Sample Surveys, Volume 
2, Theory and Bibliographies, (Eds. W.G. Madow, I. Olkin and D.B. Rubin), New York: Academic 
Press, 185-207. 


GREENLEES, W.S., REECE, J.S., and ZIESCHANG, K.D. (1982). Imputation of missing values 
when the probability of response depends on the variable being imputed. Journal of the American 
Statistical Association, 77, 251-261. 


HERZOG, T.N., and RUBIN, D.B. (1983). Using multiple imputation to handle nonresponse in sam- 
ple surveys. In Incomplete data in Sample Surveys, Volume 2, Theory and Bibliographies, (Eds. 
W.G. Madow, I. Olkin and D.B. Rubin), New York: Academic Press, 209-245. 


KALTON, G. (1983). Compensating for Missing Survey Data. Ann Arbor: Survey Research Center, 
University of Michigan. 

KALTON, G. (1986). Handling wave nonresponse in panel surveys. Journal of Official Statistics, 2. 
forthcoming. 

KALTON, G., and KASPRZYK, D. (1982). Imputing for missing survey responses. Proceedings of 
the Section on Survey Research Methods, American Statistical Association, 22-31. 


KALTON, G., and KISH, L. (1984). Some efficient random imputation methods. Communications 
in Statistics - Theory and Methods, 13(16), 1919-1939. 


KISH, L. (1965). Survey Sampling. New York: Wiley. 


KISH, L. (1976). Optima and proxima in linear sample designs. Journal of the Royal Statistical Socie- 
ty, Ser. A, 139, 80-95. 


LITTLE, R.J.A. (1982). Models for nonresponse in sample surveys. Journal of the American Statistical 
Association, 77, 237-250. 


LITTLE, R.J.A. (1986a). Survey nonresponse adjustments for estimates of means. Jnternational 
Statistical Review, 54, 139-157. 


LITTLE, R.J.A. (1986b). Missing data in Census Bureau surveys. Proceedings of the Second Annual 
Census Bureau Research Conference, 442-454. 


16 Kalton and Kasprzyk: Treatment of Missing Survey Data 


LITTLE, R.J.A., and DAVID, M.H. (1983). Weighting adjustments for non-response in panel surveys. 
Working Paper, Washington, D.C.: U.S. Bureau of the Census. 


OH, H.L., and SCHEUREN, F. (1978a). Multivariate raking ratio estimation in the 1973 Exact Match 
Study. Proceedings of the Section on Survey Research Methods, American Statistical Association, 
716-722. 


OH, H.L., and SCHEUREN, F. (1978b). Some unresolved application issues in raking ratio estima- 
tion. Proceedings of the Section on Survey Research Methods, American Statistical Association, 
723-728. 


OH, H.L., and SCHEUREN, F. (1980). Estimating the variance impact of missing CPS income data. 
Proceedings of the Section on Survey Research Methods, American Statistical Association, 408-415. 


OH, H.L., and SCHEUREN, F. (1983). Weighting adjustment for unit nonresponse. In Jncomplete 
data in Sample Surveys, Volume 2, Theory and Bibliographies, (Eds. W.G. Madow, I. Olkin and 
D.B. Rubin), New York: Academic Press, 143-184. 


OH, H.L., SCHEUREN, F., and NISSELSON, H. (1980). Differential bias impacts of alternative 
Census Bureau hot deck procedures for imputing missing CPS income data. Proceedings of the 
Section on Survey Research Methods, American Statistical Association, 416-420. 


PALMER, S. (1967). On the character and influence of nonresponse in the Current Population Survey. 
Proceedings of the Social Statistics Section, American Statistical Association, 73-80. 


PALMER, S., and JONES, C. (1966). A look at alternate imputation procedures for CPS noninter- 
views. Washington, D.C.: U.S. Bureau of the Census memorandum. 


POLITZ, A., and SIMMONS, W. (1949). I. An attempt to get the ‘not at homes’ into the sample 
without callbacks. II. Further theoretical considerations regarding the plan for eliminating callbacks. 
Journal of the American Statistical Association, 44, 9-31. 


POLITZ, A., and SIMMONS, W. (1950). Note on an attempt to get the ‘not at homes’ into the sam- 
ple without callbacks. Journal of the American Statistical Association, 45, 136-137. 


RUBIN, D.B. (1978). Multiple imputations in sample surveys: a phenomenological Bayesian approach 
to nonresponse. Proceedings of the Section on Survey Research Methods, American Statistical 
Association, 20-34. 


RUBIN, D.B. (1979). Illustrating the use of multiple imputations to handle nonresponse in sample 
surveys. Bulletin of the International Statistical Institute, 48(2), 517-532. 


RUBIN, D.B., and SCHENKER, N. (1986). Multiple imputation for interval estimation from simple 
random samples with ignorable nonresponse. Journal of the American Statistical Association, 81, 
366-374. 


SANDE, G. (1979). Numerical edit and imputation. Paper presented to the International Association 
for Statistical Computing, 42nd Session of the International Statistical Institute. 


SANDE, I.G. (1983). Hot-deck imputation procedures. In Incomplete Data in Sample Surveys, Volume 
3, Proceedings of the Symposium, (Eds. W.G. Madow and I. Olkin), New York: Academic Press, 
339-349. 


SANTOS, R.L. (1981). Effects of imputation on regression coefficients. Proceedings of the Section 
on Survey Research Methods, American Statistical Association, 140-145. 


THOMSEN, I. (1973). A note on the efficiency of weighting subclass means to reduce the effects of 
nonresponse when analyzing survey data. Statistisk Tidskrift, 4, 278-283. 


THOMSEN, I., and SIRING, E. (1983). On the causes and effects of nonresponse: Norwegian ex- 
periences. In Jncomplete Data in Sample Surveys, Volume 3, Proceedings of the Symposium, (Eds. 
W.G. Madow and I. Olkin), New York: Academic Press, 25-29. 


VACEK, P.M., and ASHIKAGA, T. (1980). An examination of the nearest neighbor rule for imputing 
missing values. Proceedings of the Statistical Computing Section, American Statistical Association, 
326-331. 


WELNIAK, E.J., and CODER, J.F. (1980). A measure of the bias in the March CPS earnings im- 
putation system. Proceedings of the Section on Survey Research Methods, American Statistical 
Association, 421-425. 


Survey Methodology, June 1986 17 
Vol. 12, No. 1, pp. 17-27 
Statistics Canada 


On the Definitions of Response Rates 
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ABSTRACT 


In this paper, different types of response/nonresponse and associated measures such as rates are pro- 
vided and discussed together with their implications on both estimation and administrative procedures. 
The missing data problems lead to inconsistent terminology related to nonresponse such as completion 
rates, eligibility rates, contact rates, and refusal rates, many of which can be defined in different ways. 
In addition, there are item nonresponse rates as well as characteristic response rates. Depending on the 
uses, the rates may be weighted or unweighted. 


KEY WORDS: Eligibility; Completion; Contact; Refusal; Response Rates. 


1. INTRODUCTION 


The census or sample survey data are gathered by any one of such procedures as personal 
interview, telephone, or mail. It sometimes happens that some units may not respond for such 
reasons as “not at home”, “away on vacation”, “units closed”, “respondent refusal”, “unit 
vacant” or “demolished”, etc. Other units may respond only partially, e.g. some but not all 
persons within a dwelling may respond or the units may respond to some but not all ques- 
tions. Furthermore, units may respond to questions but provide incorrect or inaccurate 
responses. 

Thus, any survey, whatever its type and method of data collection, will suffer from miss- 
ing data due to nonresponse. Nonresponse has been generally recognized as an important 
measure of the quality of data since it affects the estimates by introducing a possible bias 
in the estimates and an increase in sampling variance because of the reduced sample. The 
relationship between sampling variance and the nonresponse rate is fairly straightforward. 
However, the relationship between the bias and the size of nonresponse while perhaps more 
important is less obvious since it depends on both the magnitude of nonresponse and the 
differences in the characteristics between respondents and nonrespondents. One can speculate 
that the nonresponse bias is proportional to the nonresponse rate. For a given response rate, 
the percentage bias would then be independent of sample size. However, the sampling variance 
is affected by the sample size and is inversely proportional to the responding sample size. 
Thus, the nonresponse bias may not be nearly so serious relative to the sampling errors for 
small samples as it is for large samples. The apparent confidence interval may cover the true 
value in the case of small samples but may not in the case of large samples in the presence 
of nonresponse bias. If we measure the “seriousness” of the nonresponse bias by the ratio 
of the nonresponse bias to the coefficient of sampling variation, then the “seriousness” of 
the nonresponse bias is proportional to the square root of the responding sample size times 
the nonresponse rate. 

In a more practical way, the size of response/nonresponse may indicate the operational 
problems and provide an insight into the reliability of survey data. However, different types 
of response/nonresponse rates are used for these two purposes, depending upon whether or 
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or not a contact has been made with a designated unit. One can therefore distinguish bet- 
ween ‘‘contact’’ and ‘‘no contact’’ of types. One type such as ‘‘no one at home”’ or ‘“tem- 
porarily absent’’ is in fact a ‘‘no contact’’ problem and is primarily operationally oriented. 
The other type is the true nonresponse problem, where contact has been made with the selected 
unit but no response or acceptable response is obtained. 

In an interview process itself an interviewer may find units in the sample that should not 
be there (ineligible for the sample). Also, there will be units with questionnaires only or par- 
tially completed as well as units with all questionnaires completed. Each of these events may 
be defined as a rate, i.e. eligibility rate, item response rate, completion rate, etc. The distinc- 
tion between the ‘‘true’’ nonresponse and other causes affecting the total size of nonresponse 
rate may give rise to different interpretations. 

The interpretation of response/nonresponse rates is particularly difficult when one deals 
with complex survey designs since the concentration of nonresponse may be higher in one 
area or class than in another. Still, response rates have been used as proxies for data quality 
by almost all survey statisticians. That is why the interest in collecting data on nonresponse 
and the evaluation of it has usually been part of survey taking. However, only the measures 
of bias, variance, and the resultant mean square error from all sources of sampling and non- 
sampling errors can provide an informed basis for evaluating survey results. 

Recently, nonresponse has been increasing in many surveys in Canada and elsewhere. Con- 
sequently, there is a greater need than ever before to monitor nonresponse rates, to make 
comparisons between surveys, countries, survey organizations, and to ensure some degree 
of comparability. There have been attempts to standardize the definition of response rate 
and its complement, the nonresponse rate; see for example, Kviz (1977), Cannell (1978). Pro- 
blems of inconsistent definitions of response rates related to telephone surveys are described 
by Wiseman and McDonald (1980). 

There are also problems of inconsistent terminology with regard to response/nonresponse 
in surveys. Terms such as completion rate, contact rate, and under-coverage rate have been 
used in different contexts in reports and articles dealing with data collection. While these 
terms may be readily distinguished in an individual report, they may be confusing and sub- 
ject to conflicting interpretations, when studying different reports. 

To consider response/nonresponse problems, a distinction must be made between unit 
and item nonresponse rates. Unit nonresponse rates generally pertain to the level at which 
survey data are gathered during the first contact. Examples of the level could be a dwelling, 
individual, store or establishment. However, in the case of multi-stage sampling, there may 
be nonresponse of all units within clusters or even primary sampling units (psu) so that unit 
nonresponse could apply to a selected cluster or psu as well as a dwelling or individual. 


Item nonresponse usually pertains to the questionnaires, where information has been pro- | 


vided for some questions but not to all that should have been provided. However, if a unit 
fails to respond, it automatically fails to respond to any item. Hence, unit nonresponse and 
item nonresponse are distinct events that should be dealt with separately. 

The response rates may pertain to the whole sample and part of a sample such as design- 
dependent areas or they may apply to administrative areas such as an interviewer assign- 
ment, or a group of assignments overseen by a supervisor or field office. 


2. RESPONSE/NONRESPONSE COMPONENTS 


In order to define various response rates and discuss their uses and applications, it is necessary 
to split up the target population for the sample or census into the various components, by 
type of response/nonresponse. Table 1 accomplishes this very purpose, indicating most of the 
important components of the whole survey that will be used in the rates. Once a target popula- 
tion (Box 1) is defined for a survey, a survey frame of N units (Box 2) is then determined. 
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Table 1 


Response/Nonresponse Components 
() 


TARGET 
POPULATION 


=> N Units 


(2) 


Sample/Census 
Frame 
N Units 


Survey Data Gathering Procedure 
personal, telephone, mail, or combination 


(4) 


Sample Selection or Census 
n< N Units;n = Lt; 


19 


ree) 
Ineligible Units 


(6) (7) 


Ele e;))( ee 10;) 
Correctly 
Not Enumerated 


L4,(1 — e;)6; 
Enumerated, but 
should not have been 


(9) (10) 
21e,6;116), 
Unit respondents 
(full item response) 


X7;e;6;( Iie Tj) 


Unit respondents 
(some item nonresponse) 


(12) (13) 
Lt;e;6;5iy 


item y 
respondents 
(14) 
DEER — Op Nae 
refusals for 
item y 
(15) (16A) (16B) 


Without 
Response 
Error 


Detected Undetected 
Response Response 
Error Error 


For Item y For Item y 


For Item y 


e; = 1,0 (unit eligible/ineligible) 


t; = 1,0 (selected/not selected) 


L1;e;6;(1 
item y 
nonrespondents 


Lt je; 


Eligible Units 


(1) 


Liekl —%6;) 


Unit nonrespondents 


= Siy ) 


2E;2; 0; (ge nO jy) le a) 


other than refusals ~ 
item y 


(18) (19) 


Ete(1 — 6)r; Iie =e Se 
Unit Nonresponse 
(other than refusal) 


Unit Nonresponse 
(Refusal) 


6; = 1,0 (unit response/nonresponse) 
6;, = 1,0 (item y response/nonresponse) 
r; = 1,0 according as unit refused or not 


For r; = 0, mainly ‘“‘Not at Home”’ 
or ‘“‘Temporarily Absent’’ 
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It should be mentioned that as a result of possible under- and over-coverage of units the 
frame may not correspond exactly to the target population. Since under- coverage is usually 
more prevalent than over-coverage in practice, the actual target population usually contains 
more than WN units. 

For the survey to be taken, a data gathering procedure (Box 3) and an appropriate design 
are decided upon, by or census n = Lf; units are selected, where: 


t; = 1 or O according as unit / is selected or not, 


u 


summation over all N units in the survey frame. 


Often, in a sample frame, N may not be precisely known but rather can only be estimated 
from the sample. This is often the case in multi-stage probability samples with area sampling 
at earlier stages of selection. 

Out of the sample of 7 units, L ¢;e; are eligible (Box 8) and ¥ ¢;(1—e;) are ineligible (Box 
5) for the survey, where 


e; = | or 0 according as unit / is eligible or not. 


Sometimes the eligibility criterion may not be determined if the unit cannot be contacted 
while at other times the eligibility criterion is obvious from the physical appearance, such 
as vacant/non-vacant dwellings in a household survey. 

The £7;(1 — e;) ineligible units of (Box 5) may be split up between Lt (1 —e)Q1 —- 
6;) units not interviewed just as they should not have been (Box 6) and © t;(1 — e;) 6; units 
incorrectly interviewed (Box 7). One hopes that the number of such units in Box 7 is non- 
existent or at least very small. However, if such units are discovered, they should be deleted 
from the sample. In the above and in the breakdowns that follow, 6; = 1 or 0 according 
as unit 7 responded or did not respond. 

The & ¢,e; eligible units (Box 8) may be split up between L t;e;6; unit respondents (Box 
9 + Box 10) and L¢;e,(1 — 6;) unit nonrespondents (Box 11), i.e. they provided no usable 
survey data and little, if anything, is known about the units, except perhaps their geographic 
location. 

The © ¢;e;6; units respondents may be split up first between L t;e; 6; 11(6;,) units, free 
of item nonresponse, but with possible response errors (Box 9) and Yze,6;[1 — TW djy)] units 
with item nonresponse in at least one characteristic but not in all characteristics (Box 10). 
Here 6;,, = | or 0 according as responding unit 7 responds or does not respond to item or 
characteristic y. In (Box 9), dj, = 1 for unit 7 and all items while in (Box 10), dj = 0 for 
one or more items but not for all of them. For a particular item y, some of the £ 1j6; 0; Ory 
item y respondents (Box 12) come from those unit respondents, free of item nonresponse 
in (Box 9) while the remainder come from those unit respondents with some item nonresponse 
among one or more items other than item y. The ¢,e;6;(1 — dj,) item y nonrespondents 
of (Box 13) come from those unit respondents with some item nonresponse of (Box 10) that 
include item y. 

The item y respondents of (Box 12) may be decomposed into three components, (i) those 
units with item y free of response error, (ii) those with a detected response error for item 
y, and (iii) those with an undetected response error for item y, in Boxes 15, 16A, and 16B 
respectively. 
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The ¥ ¢;e;6;(1 — 46j,) item y nonrespondents (Box 13) all come from the unit respondents, 
i.e. 6; = 1, 6;, = 0. These item nonrespondents may be decomposed into 2 components, 
viz., (i) those who refused to reply to question y or those who terminated the interview prior 
to item y (Box 14) and (ii) those who failed to reply to supply data for item y because of 
misunderstanding by either the respondent or interviewer or because of other reasons such 
as failure to follow the proper path in the questionnaire. 

Finally, the unit nonrespondent (Box 11) may be split up among refusals (Box 18) and 
other than refusals (Box 19) mainly non-contacts with reasons such as not at home or tem- 
porarily absent. Here, 7; = 1 for refusal and r; = 0 for cases of ‘‘other than refusal’’. The 
cases of ‘‘other than refusals’’ pertain mainly to ‘‘not at Home’’ or ‘‘Temporarily absent.”’ 

In order to count the respondents and nonrespondents according to type and reason, careful 
records must be kept of every sampled unit. This is essential if a probability sample is not 
to deteriorate into a quota sample, for example, because of ad hoc treatment of nonresponse, 
such as arbitrary substitution of other units for the nonrespondents. In the case of quota 
samples, it is sometimes difficult or impossible to distinguish substituted units from originally 
selected units when survey takers try to reach the quota with easy-to-obtain survey data from 
co-operative respondents rather than attempt call-backs of nonrespondents. 

Even in probability samples with units carefully labelled and monitored according to plan, 
it is sometimes difficult to determine precisely the reason for nonresponse among the units 
that failed to be contacted. The problem is usually most straightforward in the case of per- 
sonal interviews. However, even in that case, it may be difficult to distinguish ‘‘no one at 
home’’ from ‘‘temporarily absent’’ or ‘‘refusals’’ from ‘‘non-contacts’’ when persons are 
obviously at home but refuse to answer the door. In the case of telephone interviews, ‘‘no 
answer’’ or ‘‘busy signal’’ reveals nothing about the lack of contact of the selected unit 
although ‘‘refusals’’ of contacted units by telephone may be evident. In the case of mail 
surveys, when the mail is not returned, the reason could be ‘‘refusal’’ just as easily as ‘‘tem- 
porarily absent’’. The ‘‘not at home (unit)’’ in the usual context of nonresponse studies as 
distinguished from ‘‘away from home (unit)’’ does not apply to mail surveys. In mail surveys, 
the reason for nonresponse usually must be determined by personal or telephone follow-up 
of the unit, often by sub-sampling nonrespondents, some of which may become respondents 
while others may remain nonrespondents for reasons that may be determined. 

The eligibility of selected units is usually evident in the case of personal interviews although 
failure to contact the units may result in an interviewer’s inability to screen out undesirable 
types of units for a particular survey. No phone answers or busy signals may result in a com- 
plete failure to determine either the eligibility or type of nonresponse of the unit. Discon- 
nected telephone numbers or ineligible telephone respondents in a screening survey will provide 
some measures of ineligibility in a telephone survey. In the case of mail surveys, some returned 
mail or addresses non- existent among selected units may yield clues about some types of 
ineligibility while other types may be discernable only by means of personal or telephone 
follow-up. 


3. DEFINITIONS OF VARIOUS RATES 


The sample of n = L¢; units decomposed in Table 1 in section (2) into eligibible units, unit 
respondents/nonrespondents, refusals, item respondents/non-respondents, etc. leads to many 
different types of rates which are defined below. For each rate, the numerator is a particular 
subset of the denominator. Wherever possible, the rate is defined in terms of the counts of 
units as broken down in Table 1. 
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(a) Eligibility Rate 
The eligibility rate is given by: 


é Sage Ger) y) t;, = (Box 8)/(Box 4). (3.1) 


Wiseman and McDonald (1980) used the term ’’incidence rate’’ but applied the term only 
to selected persons of telephone samples that actually answered (responded) at the screening 
phase to determine their eligibility for the survey. 

The eligibility rate, as in (3.1), demonstrates the quality of the survey design in selecting 
eligible units from a frame, where the eligibility may not be readily determinable without 
some cursory contact or observation. The rate provides, at the screening stage, information 
to determine how many eligible units will result at the survey data gathering stage. Thus, 
the rate may be employed at the design stage if data on eligibility are available from earlier 
studies. Depending upon the nature and procedure of the survey, the eligibility of units may 
not be determinable among non-contact or even refusable units. There are two alternatives 
to the definition of eligibility rate and response rates (which will be defined later) pertaining 
to eligible units. One can assume, for conservative estimates of data quality and the quality 
of the procedure for gathering survey data that all non-contacts and refusals would be eligi- 
ble even though realistically the proportion of eligible units among such nonrespondents is 
often lower than among respondents and non-respondents for which the eligibility criteria 
are known. Under the above assumption a lower bound for the response rate and an upper 
bound for eligibility rate would be obtained. Alternatively, one can assume the same pro- 
_ portion of eligible units among units whose eligibility cannot be determined as among those 
whose eligibility are known. Under that assumption we would likely have a slight over-estimate 
of eligibility rate and some of the other rates. 


(b) Response and Completion Rates 


(i) According to one of two alternative definitions provided by the U.S. Federal Commit- 
tee on Statistical Methodology (1978), the response rate is the percentage of the eligible 
sample for which information (survey data) is obtained. Thus the response rate is defin- 
ed as: 


R= Ne £,e}0;/ a Le; (3.2) 
i i 


[(Box 9) + (Box 10)]/(Box 8). 


The above is the most commonly employed response rate in practice as it yields the per- 
cent of the sample for which some useful survey data are obtained once the ineligible units 
are deleted. All types of non- respondents of eligible units are included in the denominator. 

The inverse of the above rate at an adjustment cell is frequently used as a weight adjust- 
ment to compensate for missing data of nonresponding units, for example, such rates are 
frequently use in the Canadian LFS for weight adjustments (see Platek and Gray 1985). 

The above rate or its complement, the nonresponse rate, is frequently used for ad- 
ministrative and operational assessments of survey organizations. The rates are also used 
to assess interviewer’s ability to contact respondents and to elect this co-operation to pro- 
vide usable survey data, e.g., response/nonresponse rates by interview assignment. The non- 
response rate includes both refusals, which may be controlled by good public relations and 
diplomacy, and non-contacts, which may be beyond the control of the interviewer. Hence, 
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whereever possible, the nonresponse rates are frequently split up by reasons. The overall 
response rate in LFS is abour 95% in most months. Out of the 5% nonresponse about 1% 
are refusals. 

A similar rate to the above was defined as a completion rate by Kviz (1977), who included 
the whole sample in the denominator. Such a rate may provide a more conservative estimate 
of quality that (3.2) in that ineligible units such as vacants are included in the denominator. 
For example, in the LFS, the completion rate by Kviz’s definition would drop from 95% 
according to 3.2 to about 85%. 


(ii) Another definition by the above-mentioned committee is the percentage of times an in- 
terviewer obtains interviews at sample addresses, where contacts are made given by: 


Rios MOOK On/ 2) oteplon (Liat On) 75 (3.3) 


where unit 7 refused or did not refuse according as 7; = 1 or 0 respectively. The above was 
defined as a completion rate by O’ Neill Groves, and Cannell (1979). If as in (3.3) the eligibility 
of all units that are contacted can be determined, then another and perhaps superior (known 
or estimated) definition of the above rate pertaining to eligible units can be given by 


yo ME SO SUE ee Oyen (3.4) 


[(Box 9) + (Box 10)]/[(Box 9) + (Box 10) + Box 18)] 


where e;, the eligibility criterion is defined after Table 1. 

The above rates (3.3) and (3.4) may be useful in personal and telephone surveys where 
nonrespondents may include non-contacts and refusals. The rates are not practival in mail 
surveys unless there is a telephone or peronal follow-up of nonrespondents since in most 
pune mail surveys, the survey organization is forced with either response or nonresponse 
with unknown reasons. Where the above rates may be useful, however, they measure the 
ability of a data collection method to elect co-operation of responsible respondents at selected 
units, given that they are contacted. The non-contacts, that may be beyond the control of 
interviewers in some survey procedures are removed from the rates entirely. 

The response rate in (3.4) was also defined as completion rates by Klecka and Tuchfarber 
(1979), who assumed, perhaps unrealistically, that all refusals were eligible for the survey. 
The completion rate would then have ben a conservative estimate for the measure of perfor- 
mance of the data collection method in eliciting the co-operation of eligible units. Alternatively, 
one may assume the eligibility among refusals to be the same proportion among refusals 
as among completed and other limits whose eligibility criteria is known. 


(c) Contact Rates 


A “‘contact rate’’, defined by Hauck (1974) is the percentage of sample units that are con- 
tacted as: 


Completed interviews + Refusals (contacted) 


Rees 
a Completed interviews + Refusals (contacted + Noncontacts) 


where the ‘‘Noncontacts’’ were assumed to be eligible for a conservative estimate of the suc- 
cess in contacting sampled units. The ‘‘Refusals’’ may include ‘‘Terminations’”’ or ‘‘Incomplete 
Interviews’’ that are essentially ‘‘Refusals’’ for some items as in (Box 10) of Table 1. 
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The algebraic expression for the contact rate is given by: 


Rw bs Lt /5,e; + Le(1 = 6) re; (3.5) 
btjde; + Lt(1 — 6)rjé + Lt — 6)Q — ré; 
(Box 9) + (Box 10) + (Box 18) 
= Os where 
(Box 9) + (Box 10) + (Box 18) + (Box 19) 
é; = e = 1 or Oif eligibility criterion is known, 
and, for non-contacts, 
é; = 1 according to Hauck definition, 
or é; = @, the average eligibility rate among those units whose eligibility 


criteria are known. 


The contact rate measures the ability of the survey organization or interviewers to contact 
respondents whether or not they succeeded in eliciting their co-operation. In the LFS, the 
contact rate among non-vacant dwellings is around 96% each month. 


(d) Refusal Rate (Non-refusal Rate) 


Two definitions of refusal rates are given by Hauck (1974) and Wiseman and McDonald 
(1980) respectively as: 


number of refusals 


F, = 
number of completed interviews and refusals 
= ye oe 5) ri/ LS He: f wae — 6;)7i] (3.6) 
i i i 
= (Box 18)/[(Box 9) + (Box 10) + (Box 18)] = 1 — R(3). 
a jae number of refusals 


number of all selected units 


vad ps bri Yat (3.7) 


= (Box 18)/(Box 4). 


With the eligibility criteria taken into account, the refusal rate in (3.7) may be given by: 


Fy = Yn — d)7,/ ue (3.8) 


= (Box 18)/(Box 8), where é; is defined after (3.5). 
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The refusal rate measures the extent of the inability of the survey organization or the in- 
terviewer to elicit the co-operation of units to provide usable survey data, relative to all con- 
tacted units (3.6), relative to the whole sample (3.7) or relative to the eligible sample (3.8). 
In (3.6), one may wish to determine a ‘‘pure’’ refusal rate without non-contacts that are 
often beyond the interviewers’ control in order to study the efficiency of a questionnaire 
or effect of the survey topic on the co-operation of contacted units. Alternatively, in (3.7) 
amd (3.8), one may prefer to examine the refusals rate as one, of several components of 
overall nonresponse. 


(e) Item Response/Nonresponse Rates 


Complex questionnaire design may result in item nonresponse of specific questions for reasons 
other than refusals, as noted in Box 17. A controversial or personal question or termination 
of the interview may result in a refusal to provide data for a specific item as in (Box 14). 

Thus, one may measure the overall item nonresponse rate for item y, relative to all respon- 
ding units, given by: 


y (Box 13) 
” ~~ (Box 9) + (Box 10) 


or if item y is relevant only for some units (questionnaires) but not for all of them, one 
may measure the item nonresponse relative to only those responding units for which item 
y is relevant (eligible). Consequently, one may define a whole set of item 
response/nonresponse/eligibility rates, analogous to the unit rates replacing in the rates the 
number of units (eligible/ineligible)/(responding/refusing, etc.) with the number of respon- 
ding units (eligible or relevant for item y, irrelevant, responding for item y/refusing for item 
y etc.) respectively. Most of the rates pertaining to units other than contact rates should have 
their item y counterparts readily defined by making the proper substitutions in the expres- 
sions. However, it may be more difficult to record the reasons for item nonresponse, com- 
pared with unit nonresponse, as frequently the item nonresponse is detected only through 
an edit and imputation routine. 


(f) Weighted Rates and Characteristic Rates 


In the case of sample with different sample weights II; !’s for the units as in probability pro- 


portional to size (pps) sampling, all of the above rates may be defined as weighted rates by 
applying the sample weight 11;_! with the sample selection indicator variable ¢; in all the ex- 
pressions. In the case of self-weighting samples in an area or class for which the rates are 
calculated the sample weights are redundant. In pps sampling at the final stage, however, 
the usual tendency is for large units to respond more readily than small ones so that weighted 
response rates, with smaller sample weights applied to the large units than for small units, 
tend to be smaller than unweighted rates based on the counts of units as in Table 1. 

The weighted response rates estimate the proportion of the population that would have 
responded to the survey under similar survey conditions while the unweighted response rates 
provide a measure of data collection performance only for the sample or sub-sample per- 
taining to a specified area or class. 

By estimating the nonresponse rate for the entire population rather than for the sample 
as the unweighted rates do, the weighted rate may provide misleading information on the 
quality of the data since it may distort the distribution of characteristics in the sample. The 
advantage of the weighted rates, however, is that the units are added to population levels 
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rather than sample levels so that one obtains an estimate of the rate that would prevail at 
census levels under similar conditions of gathering survey data. The weighted response rates 
may under some circumstances be used as weight adjustment factors to inflate the respondents 
to the full sample in adjustment cells. 

When defining characteristic response rates factors include the observed response y; 
among item respondents, the imputed value z;,, for item nonresponse and the imputed value 
for z; for unit nonresponse, which is usually the mean of the respondents in an adjustment 
cell. If some auxiliary value X; is known for all units, whether or not they respond, then 
a characteristic x response rate may be readily calculated and used as a weight adjustment 
when x is highly correlated with y. The characteristic y response rate, weighted by II; _! or 
unweighted, may be useful in studying the potential nonresponse bias by comparing the charc- 
teristic y response rates with the weighted or unweighted response rates based on counts of 
units. 


4. FINAL REMARKS 


Standardization of the definitions of the rates appears to be difficult, owing to the variety 
of uses and studies of nonresponse and owing to the careful record keeping demanded of 
survey takers. As long as the rates are unambiguously defined and appropriately applied 
in their analysis standard definitions for all types of surveys and survey data gathering pro- 
cedures, may not be all that important. However, in each particular case, the rate should 
be carefully defined with clear demonstration of the purpose for which it is intended and 
the reason why it is adopted. 

Another issue of standardization dealing with the topic of response/non- response rates 
is the standard of what is expected from past experience for given surveys, type of survey, 
subject matter and interview procedure. For example, the response rate, according to 3.2, 
in the LFS, is expected to be in the 93 to 95% range, with slightly lower rates in the summer 
months. Out of the 5 to 7% nonresponse, 1% or so may be expected to be refusals. The 
overall rates have been remarkably consistent for the history of the survey. 

It has been observed (see Platek 1977) that finance-oriented surveys tend to have lower 
response (higher nonresponse) rates than surveys dealing with other topics. The finance surveys 
appear to be around 25% nonresponse while most of the others centre around 10 to 15%. 
Also, telephone surveys appear to have a slightly higher nonresponse rate (by about 2 to 
3%) than personal surveys for similar subject matter. Thus, from experience, one can deter- 
mine a standard objective for surveys of a given subject and interview procedure. 

It has been observed in publications such as Wiseman and McDonald (1980) that there 
are many opinions of the way nonresponse should be defined and measured. Thus, it ap- 
pears that one must grapple with the alternative definitions and terms and obtain relation- 
ships between them under various survey conditions. We have attempted to focus on the 
problems of the various definitions, terms and standards of response rates but have not solved 
the problems. A proper study can really be undertaken only with a thorough evaluation of 
survey records, which is possible only when good records are kept. Often, particularly in 
the case of quota samples, in telephone and mail surveys, nonrespondents are set aside and 
other units are substituted for them and treated like the originally selected units. The result 
is a higher observed quality of survey than is the case in reality because of the hidden 
nonresponse bias. Consequently, the way of treating nonrespondents and the evaluation of 
nonresponse, completion, etc. must be planned in advance of the survey data gathering in 
order to deal with it properly rather than during or after the survey. 


Survey Methodology, June 1986 27 


REFERENCES 


CANNELL, CHARLES (1978). Discussion of response rates. Health Survey Research Methods Con- 
ference, DHEW Publication No. (PHS) 79-3207. 


HAUCK, MATTHEW (1974). Planning field operations. In Handbook of Marketing Research (Robert 
Ferber), New York: McGraw-Hill, 147-159. 


KALTON, GRAHAM (1981). Compensating for missing survey data. Survey Research Center, In- 
stitute for Social Research, University of Michigan, Annarbor, MI. 


KLECKA, W.R. and A.J. TUCHFARBER (1979). Random digit dialing: A comparison to personal 
surveys. Public Opinion Quarterly (Spring), 105-114. 


KVIZ, FREDERICK J. (1977). Toward a standard definition of response rate. Public Opinion Quarterly 
(Summer), 265-267. 


LINDSTROM, HAKAN (1983). Non-response errors in sample surveys. Urval, Nummer 16, Skrift- 
serie utgiven av Statistika Centralbyran, Statistics Sweden, Stockholm. 


O’NEILL, MICHAEL J., GROVES, ROBERT M., and CANNELL, CHARLES F. (1979). Telephone 
interview introductions and refusal rates: Experiments in increasing respondent cooperation. Paper 
presented at the 1979 Meeting of the American Statistical Association, Washington, D.C. 


PLATEK, RICHARD (1977). Some factors affecting non-response. Paper presented at the Interna- 
tional Statistical Institute, New Delhi, December. 


PLATEK, RICHARD and GRAY, G.B. (1985). Some aspects of nonresponse adjustment. Survey 
Methodology, 11, 1-14. 


WISEMAN, FREDERICK, and PHILIP McDONALD (1978). The nonresponse problem in consumer 
telephone survey. Report No. 78-116, Marketing Stience Institue, Cambridge, Mass. 


WISEMAN, FREDERICK, and PHILIP MCDONALD (1980). Toward the development of industry 
standards for response and nonresponse rates. Report no. 80-101, Marketing Science Institute, Cam- 
bridge, Mass. 


ane 


rather chan sanpie teveteey uel C2 Oaplgyipaepmas ie c thie 
-_ census le veh ubder aienilarcentinia a . ; ie weolgiite 
le Be a a cnn ae 3 
as oth ; uy sample it afiualniens = ia, ~NOSE-OF (2H) -0v1 adits: old 
wi ony ukeviiiime 
redo) doriescath ee ote asa if oftettierage Lisa gheiWES Ye 
ae nz tem respondeats, ineafgp ied vatue 2 ip NRRL TOP MIFBRAAHER Mich 


fees Ler Bt [ OLN Se ich « " jal) = Vw wheats 
7 ot! ley ’ 4 ‘o1egad, voviue ets: epriue gaps 38 poreetarneeeeneetts 
: 0 cea as 


|" re) NOTED ITA gir a We ar its 
‘ A CT Sracic¢©ristce & ret Orie : figs f i <i 
teas 2} Of, OQHIBGINNG, f. shi sD Hl gon -A. SS! rive AGRARIAN bi 

} ro ae noe al a Fear (gniw nay URRY" Lienittgee 


wyweighted. ria} © usta ra supine the ota al BOs: CBP Ls mas rae 8 atye. 
Peto cwinvad FeAl ales pant yan} nat TEs i} Nba: 2.0. Hie T cons ea S23 
1m ei 14 
a3) : i 7 
4fhke .2! eau le U aeoue slope ai MOTs seangest-NoW (E801) AAAAH | omnia 
mladiiool? ,nshaw2 acilaied nerydlains> ealivatinng in insight Fag 
doritgotyT eV Vi >). 27, DAH ta aR Wey PAT RRROA 2AVOF ef JGABDIM 3 
sar .colipeqoou lgsbroqen gnlessroyti ot ered negra :acisy iseutox bas enonoubount wai 
| 7 ,polanide/, , aaa Jester asalis (tA 5 dy i6 52 Hea @ve{ sii ie 
i] os 7 ) a ~ P aot € ie SAT! iv ents CHD 


ot-arertid yoda Donteee qc ash a coqage tog, gating lin Potat m9 hy rene pri 
; , ean Alls at wh 5 ayia Re yr 
. | f F er \ ie Tety® 
I ws 2i rT: map hidy gen (ranth ens nsorien ioe j te ch aT £2 ake Ay 73 hag DRA 233i « oy ah 
‘ — yt ej ahs 

’ a : , msi4 “ . A) 3 wiles Ag\gh y ARs 
TNUAne nian didn wikbqetnon!afIteT Gd A OCA fu9 an ia Barer ers) Be ; 
hirinige ,ouwiuent micas nls ain nM @I1-8T .o/ negsd .vo'rui2 of pias ea 
: ALP 6 Seino 8 Sal avast VOR sper Ath e Note TH} Att inners SONSE OR as Bre vi 

‘ paeNe 
LAS U4eed pS ere uli wee Ot F088 ori 1OQVeeadaMe IGE iyi ef ta i Ganesh nee ile 


thon 3 A sae 


hy 
‘ a) Ootd uy" 3 ‘a lc, sbé costes oth o 
mH tne * read f ie eet ed to» be refusal 
overall Pate Wave 'pracn nut adly corsiate lose 4 if} ‘P ie datvérs vs x 
> Te Tita. heed ote i Plato’ 10° 7} that fi eo lenial «tir — y ty crete oa 
oO Sponnes (Mh u j oTbnd] vo vith other toy ics, 7 e finaniies cury oe 
eppene 10 fi ay ' (LYCOtIM if Ing i Ochers Coditex atound io te es 
lan, LUD ‘ ip( sor heave. a shebtly higher noareiponiciftalé ( by ubotit: \ 
30) an Peron LSurveys for simillansuy ect alr Le, 1 oo Leeri CiKE, ons oun de on 
nine @ siatidand objective fc TeeyeE.ot Ws suiject ard iniervied Prt sel eal. 
't bus nen wbserved tn piblicutions auch aa Witernat aed Mcb aa (AN ware 
ire mahy omimhons © iC WA) ! ponte i oe TAltae’ £48 aiewred, Fhug i ’ 
4 care (het Ones eran wilh the al tmalive delinwions and terms ui ni otal 
(| ditipe Reowert) shegn dader egricus syevey conditions, We havesiic retin! ro ucus 9 
problem afine various defininons, ier dod dandacdived j OO yer rats out have 


Li 


ihe — bien A coroner Gudy cap really he unt ako ony with a ough & evalua 
. a ist ad Ocords, when D a ible only ye aeey: hocsoy ile ae Xv ent, Mice, ‘arin 
- ‘ieee of enois samples, iy celephoneuncapail sures, nunrepondanty ane et Bales 7 rc 
oer NDA Gc SOStLi cc Hin i Ale te: in: teared ke he one es) yy sclected wanltic ihe : iat r 


i ; Lan 
ina digher Odserved quality of ava ey thi eri ie jhe Cast 5 italy Decline a 


- shen 


a 


ete paar ties ond bs jocmtly, che wea nit rearing, donespor des a ck _— 
| ataretiponss, cop let loa aL husks Aya jap? nee Of Ane at Ong 


ier it fost wth, ¥ very e uh 
ne a 


rite (ney 


xe eel td — 


Survey Methodology, June 1986 29 
Vol. 12, No. 1, pp. 29-36 
Statistics Canada 


Some Optimality Results in the Presence 
of Nonresponse 


V.P. GODAMBE and M.E. THOMPSON! 


ABSTRACT 


Using the optimal estimating functions for survey sampling estimation (Godambe and Thompson 1986), 
we obtain some optimality results for nonresponse situations in survey sampling. 


KEYWORDS: Optimum estimating function; Nonresponse. 


1. INTRODUCTION AND BACKGROUND 


A typical survey sampling set-up consists of a survey population P of N labelled individuals 
i; P = {i: i = 1, ..., N}. With each individual i is associated a real value y;. The vector 
Y = (Wp «+s Ys ---» Yn) is called the population vector. Any subset s of P is called a sam- 
ple. Let S = {s}. Any probability distribution p on S is called a sampling design. A sample 
s is drawn using a sampling design p, and the values yi: és are ascertained through a survey. 
Thus the data here are x, where 


Xs = (5, (Ly): © 5}. (1.1) 


On the basis of the data x, one tries to estimate a survey population parameter 0,,, that 
is a specified real function of the population vector y; On = On(y). 

In relation to the above estimation problem we assume a superpopulation model under 
which yj, ..., Yy are independent and for certain known covariate values Ni aks eeING 


ElVi> O%;) 25.0, = le. SN, (1.2) 


€ being the expectation with respect to the model. In the model (1.2), 6 is the usual unknown 
regression parameter, the expectation being taken holding x; fixed. The usual intercept term 
of the regression model is not mentioned in (1.2), for this term can often be eliminated by 
an appropriate stratification (Godambe 1982). Note the model (1.2) does not specify the 
variance function. 

Following Godambe and Thompson (1986), for some specified numbers apes 162.3, 
N, we define the survey population parameter 6, as the solution of the equation 


(¥; — Ox;)a; = 0. (1.3) 


ll 
= 


i 
Me 


1 V.P. Godambe and M.E. Thompson, Department of Statistics and Actuarial Science, University of Waterloo, 
Ontario, Canada, N2L 3G1. 
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That is, 


N N 
ON = ‘P yo; / D Xj Qj. (1.4) 
i 


i=1 
The parameter 6, is related to the model (1.2) through the equation 


ef = 0. (1.5) 


Any real function A of the data x, in (1.1) and the parameter @ is called an unbiased 
estimating function for both the parameters 6y and @ if 


E(h — &) = 0 for all y and 0 (1.6) 


‘E” being the expectation under the sampling design p employed to draw the sample s. Because 
of (1.5) and (1.6) we say the solution of the equation 


h(x) = 0, 


for the given data x,, estimates both the parameters and @ and 6y, given by (1.2) and (1.4) 
respectively. For the function g in (1.4), under the sampling design p, let H(,) be the class 
of all unbiased estimating functions h. That is 


H(p) = {h: E(h — &) = 0 for all y and 0}. (1.7) 


Now we say an estimating function h* € H(p) is optimum if 


cE(h*)? < cE(h)’, for all h € A(p) (1.8) 


(Godambe and Thompson 1986). Further, when the inequality (1.8) is satisfied, 


HeI=10 (1.9) 
is said to be the optimum estimating equation for estimating the parameter 6, given by (1.3) 
and (1.4). 


For the sampling design p, used to draw a samples, let 7;, i = 1, ..., N be the inclusion 
probabilities. That is 


1 Flees Ss pisyei-= a} 2 N; (1.10) 
S31 


where s >i indicates all samples s which include the individual 7. We assume 


mvp se aN: (1.11) 
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Theorem 1.1. (Godambe and Thompson 1986). For any sampling design p satisfying (1.11), 
under the model (1.2), in the class of all unbiased estimating functions H(p) in (1.7), the 
optimum h*, that is h* satisfying (1.8), is given by 


h* = ie (VP SVOX er / 3 (1.12) 


ees) 


m; being the inclusion probability given by (1.10). Thus the optimum estimating equation 
here is 


ye (¥; — 0x;)a;/7x; = 0. (1.13) 


VE © 


The estimate 6, of the survey population parameter 6, in (1.4) and the superpopulation 
parameter @ in (1.2) is given by 


Ey a5; 
pea at SE (1.14) 


S 
De Xj; /7; 
1€s 


This estimate was previously put forward by Brewer (1963) and Hajek (1971) on some 
‘*plausibility’’ considerations. 

To explain the relationships of Theorem 1.1 above with earlier optimality results (e.g. 
Godambe 1982) we put a; = 1 in (1.3) and therefore in (1.2). Further, we consider a super- 
population model obtained from (1.2) by letting @ = 6, a specified value. Now for any 
sampling design with inclusion probabilities z; satisfying (1.11), in the class of all design un- 
biased estimates of @y (in (1.4) with a; = 1, i = 1, ..., N), the superpopulation expecta- 
tion of the design variance is minimized for the estimate 


N 

Vi — 9%; 
e=- ———— + 6 x; bel 5 
ee rs Dal (1.15) 


where X = Lj’ x;. This ‘‘optimality’’ of the estimate e at 0 = 6 carries over to all values 
of 6 if the sampling design is such that 


— 
. 


(1.16) 


Probability = 1 ge ae «) = @ 
TT; 
beso i 


B= 


Now when the sampling design satisfies condition (1.16), then 6, in (1.14) is equal to e in 
(1.15). Thus all the earlier optimality results are covered by Theorem 1.1, and it does a great 
deal more: in many situations, such as for designs with ;«.x, the condition (1.16) implies 
a fixed sample size design. In contrast the ‘‘optimality’’ in Theorem 1.1 holds regardless 
of the fixed sample size design condition. That is, the “‘optimality’’ is available for random 
sample size designs, which are common in the nonresponse situations discussed subsequently. 
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2. NONRESPONSE AND OPTIMALITY 


Suppose a sample s is drawn from the survey population P, using a sampling design p. 
Suppose because of nonresponse the variate values y; are available only for the subset s’ Cs; 
s — s' are the non-respondents. Thus now the data instead of x, in (1.1) are 


Vege (SSC ( GY) oF as «|. Cry 


We may now consider two problems of estimation: 


(I) If there were no nonresponse, that is if all the data x, in (1.1) where available, we 
would have estimated the survey population parameter 6, in (1.4) by solving the op- 
timum estimating equation given by (1.12), namely h* = 0. When the hypothetical 
data x, are replaced by x,,, in (2.1), one may try to estimate h* with some function 
h' (Xs,5'). This is in line with a suggestion of Rubin (1976). Following (1.7) we define 
the class of unbiased estimating functions h’ (for h*, given the sample s) as 


H' (p,.,s) = {h': E(h' — h*|s) = 0, for all y & 6}; (2:2) 


the ‘.’ in H’ indicates that the class H’ would be specified only after the response mechanism 
is specified. Again we define h’* as the optimum estimating function in A’ in (2.2), if 
h'* © H’ and if under the model (1.2), eh(hi*)~ = cE(h'*)- forall hn. Ga 


(II) Alternatively we could try to estimate the survey population parameter 6, directly, 
that is without estimating h* as in (I) above, from the data x,,’. In line with (1.7) 
we define the class of unbiased estimating functions h” (xs.5')- 


Hips) Sn PhS 2) = 0, loreall yy o0); (233) 


as before the ‘.’ in H” indicates that the class H”, for its specification, requires the specifica- 
tion of the response mechanism. Again h"* is called the optimum estimating function in H" if 
h"* © H" and if under (1.2), eE(h"*)* < cE(h")? for all estimating functions h" € A”. 
In H'(p,.,s) and AH" (p,.) of (2.2) and (2.3) we have left the response mechanism ‘.’ 
unspecified. Now we specify it. 
RESPONSE MECHANISM: If the individual ‘i’ of the survey population P were includ- 
ed in the sample s drawn, 


‘7? would respond with known probability q; 
and would fail to respond with probability 1—q, (2.4) 


Dead 6 RUNG Werassumecges= 10yri Cenlyns. ga: 

The response mechanism q = (q), ..-, Gn) in (2.4) completely characterizes the class 
H' (p,.,S) in (2,2) as H’ (p, q, s) and H’’(p,.) in (2.3) as H" (p, q). 

The case (I) above is implemented by the following Theorem 2.1 and the remaining 
Theorems 2.2, 2.3 and 2.4 implement the case (II). 
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Theorem 2.1. For any sampling design p satisfying (1.11), and for any sample s, in the 
class of estimating functions H’ (p, q, s) in (2.2) under the superpopulation model (1.2) 


cE{h')*|s} is minimized for h’ = h'* where 
h'* = 3 (Yi — 9X;)0;j/7;q;; (2.5) 
LS SF 


that is h'* is the optimum estimating function in H’ (2055) ene 


Proof. As was emphasized in Section 1, the optimality of h* in (1.12) obtains even for 
random sample size designs and for any values of a; 1 = 1, ..., Nin (1.3). Thus the proof 
of Theorem 2.1 is accomplished by replacing, in Theorem 1.1, the population ‘P’ by ‘s’ and 
a; by a;/m, i € s and noting that now the inclusion probabilities are i es eee 


Theorem 2.2. Let H” be the subclass of H” in (2.3) such that any estimating function 
h" (xs,5') in A” depends on (s,s’) only through s’. Then for any sampling design p satisfy- 
ing (1.11), in the class A” (p, q), under the superpopulation model (1.2), eE{(h”)7} is 
minimized for h” = h"* where 


h"* = Me i ts) 0G / ati (2.6) 
ies! 
that is h”* is the optimum estimating function in A” (pag). & 


Proof. This follows directly from Theorem 1.1, by replacing in it s by s’ and the inclusion 
probabilities by 2; by z;q;, i = 1, ..., N. 


Theorem 2.3. The estimating function h”* in (2.6) is the optimum estimating function 
in the entire class H" (p, q) given by (2.3). That is the result of the Theorem 2.2 is valid 
without the restriction to the subclass A” of H”. 0 


Proof. For any given response probabilities q in (2.4) and the sampling design p, the statistic 
({i, yj}: i € s’) is sufficient for the population vector y. More specifically, referring to (1.1) 
and (2.1), we have the conditional probability Prob(x;;' | xs,y) independent of y. Hence 
for any estimating function h” € H" (p, q) in (2.3) we have the estimating function E(h” 
| xs) = A" © A" and cE (h")* < cE(h")?. This proves Theorem 2.3. 

When s = s’, that is when there are no nonrespondents, do we still estimate h* by 
h'* = h"*? The obvious negative answer to this question is obtained, as shown by Godambe 
(1986), by an appropriate conditioning. The same reservation tends to be felt for cases where 
there are only a few nonrespondents, and again appropriate conditioning holds some pro- 
mise of a resolution. In summary the formal optimality of h’* = h” suggests that it is useful, 
and is likely to give good estimation when nonresponse is considerable and the relative values 
of the gq; are known. However, it can clearly be improved upon in situations when 
nonresponse is rare; improved versions will have natural conditional interpretations. Ap- 
propriate conditioning becomes even more important in the case of unknown response pro- 
babilities, as will be seen next. 

Now we assume that the survey population P is divided into k strata P;, of sizes Nj, 
J = 1,..., k. Further suppose that the response probabilities are constant within each stratum. 
That is 


ge g Vitoria Pays wk. (2.7) 
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Unlike in (2.4), where the response probalities were assumed to be known, now we assume 
that in (2.7), the response probabilities g/, j = 1, ..., k are unknown. Let po denote the 
stratified sampling design, consisting of drawing from the stratum P;, a simple random sam- 
ple (without replacement) of size n; j = 1, ..., k. Now as in (2.3) we define the class of 
unbiased estimating functions /,(x;,5') 


H,(po) = {hi E(t, — &) = 0 for all y, @ and g”, j = 1,..., k}, (O28) 
where q” are as in (2.7). Let sf = s’MP; and|s/| = nj, that is the size of the sample of 


respondents from the stratum P;, j = 1, ..., k. 


Theorem 2.4. For the sampling design po, in the class of estimating functions H;(o) 


in (2.8), under the superpopulation model (1.2), cE(h?) is minimized for h, = h; where 
k Hi 
Ay = Lu XL (Vinge Or) au (a3 (2.9) 
VSS Sj 


that is h* is the optimum estimating function in H;(p). 


Proof. The sampling distribution of the data x,,, in (2.1) depends, in addition to the 
unknown population vector y, on the unknown (parameter) Pej lake ke NOwaror 
every fixed y, the statistic nj, j = 1, ..., k is completely sufficient for the parameter q, 
i= Ly oes Ka Hencenoteatixed veand. 0s (2.8); 


CE A ree oO, gf OF allag od rig Laren htl 


SHE (iCii—ee)\n;, orale Mj =10: (2.10) 


ignoring sets of ‘0’ measure. Further, conditional on the number of respondents nj from 
the stratum Pj, the probability of 7 € sj is (nj/N;) (nj /nj) = (nj /N;). Hence for any 
estimating function A, € Hy, in (2.8) we have from Theorem 2.3. 


EE (Ap) Ais 1x. 1. REE CCh alone json: (2.11) 


h; being given by (2.9). Theorem 2.4 is proved by taking the expectations of both sides of 
(2.11) for the variations of nj, j = 1, ..., k. 

The optimum estimating function A¥ in (2.9) has the following intuitive interpretation. 
If in (2.7), the response probabilities qv), 7 = 1, ..., k were known, by Theorem 2.3, the 
optimum estimating function, for the sampling design p,, would be given by 


k 
nwien 
h" = e sy (yi = Ox) of Ce g™). 


j=1 i€s; 


Now when qg” are unknown (which is the case in Theorem 2.4), we estimate them by 
(n}/n;),j = 1, ..., k. Substituting these estimates for g¥ in h” yields the estimating func- 
tion h,; of (2.9). 

These estimates obtained by solving the equations h’* = 0, h”* = 0 and h; = 0in (2.5), 
(2.6) and (2.9) respectively have previously been proposed, on plausibility considerations, 
by several authors. A good reference in this connection in Cassel et al. (1983). The assump- 
tion (2.4) of ‘‘response probabilities’? seems to have evolved gradually in the literature. An 
interesting early reference in this connection is Hartley (1946). 
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3. OPTIMAL INCLUSION PROBABILITIES 
It should be emphasized here that the ‘‘optimality’’ of the estimating function h”* in (2.6) 
was established under the superpopulation model (1.2), which does not specify the variance 
function. However the specification of the variance function in the model (1.2) would be 
required to obtain the ‘‘optimal’’ inclusion probabilities. We assume 


e(y; — 0x;)? Nor f(oe,)) = 12), SiN (3.1) 


where f is a known function of x, and o? can be unknown. Now for the estimating function 
h"* in (2.6), (3.1), we have 


eh 2) = 


> ea Ox;)? a? 


Nii 


N 
ee Ne IK) 0; (3.2) 

i=l 
In (3.2), the response probabilities q; as said in (2.4) are given (fixed) numbers. However, 


(a sampling design with) the optimal inclusion probabilities can be obtained by minimizing 
cE(h"*)? in (3.2) under a restriction, either (A) or (B). 


N 
(A): ‘SS a; = constant, 


i=1 


N 
(B): ie ™;Q; = constant (3.3) 


i=] 


In (A) we hold the average size of the sample s fixed, for E|s| = LN x, In (B) we hold 
fixed the average size of the effective sample s’, for E|s| = Lj‘ 2; g;. Now since the q; are 
fixed numbers we have for minimizing eE(h"*)* in (3.2), respectively, 


(A): mj a aes, 


l 


A072 
Re te UR Tle p (3.4) 


qi 
Denoting by n’ the size of the effective samples’, that is | s'| = n’, we have from (B) in (3.4), 
CODE Cee 


fie (ys lose Gade date cations oy 


Further for a fixed sample size design such that 


Probability {s:|s| # n} = 0, 


we have from (3.5). 


n N \y 
(f(%;)) 7a; ] 
ne = ———_———_ - F'(n'). 3.6 
» mel » (LY (f(x)) 270;) 4g; ee GD 


i=1 
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As a special case, when all the response probabilities g;, i = 1, ..., N are equal, gq; = q say, 
jas Leet PN ain Buh): 


n = E(n')/q; (3.7) 


for instance if g = 1/2, the sample size of the (initial) sample s should be double the expec- 
tation of the effective sample (s’) size! 

Now we assume the survey population P to be divided into strata P;,, = 1, ..., k so that 
the response probabilities in each stratum are constant, that is they satisfy (2.7). For a stratified 
sampling design consisting of drawing a sample of size n; from the stratum P,, j = 1, ... k 
we have from (3.5). 


Le (WACANE 28s 
rE) eee i Koes 


n; = : 
60 of tka eddies y ye. 
iEP “ e 


If (f(x;)) “a; are constant fori = 1, ..., N, it is clear from (3.8) that optimal allocation 
implies drawing a relatively larger sample from the stratum with smaller response probabili- 
ty. Actually in this situation 


E(n;) = E(n')/k 


where nj is the size of the effective sample s; from the stratum P;, j = 1, ..., k. 
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Basic Ideas of Multiple Imputation for Nonresponse 


DONALD B. RUBIN! 


ABSTRACT 


Multiple imputation is a technique for handling survey nonresponse that replaces each missing value 
created by nonresponse by a vector of possible values that reflect uncertainty about which values to 
impute. A simple example and brief overview of the underlying theory are used to introduce the general 
procedure. 


KEY WORDS: Survey nonresponse; Proper imputation methods; Multiple imputation. 


1. INTRODUCTION 


Any statistician with experience in the field of surveys knows that essentially every survey 
suffers from some nonresponse. That is, in practical surveys, some items in the survey in- 
strument are not answered by all units included in the survey. Commonly, the items likely 
to be unanswered are the more sensitive ones, such as those concerning personal income. 
Because nonresponse creates missing values, the complete-data statistics that would have been 
used in the absence of nonresponse can no longer be calculated. An obvious desire of both 
the data collector and the data analyst is to get rid of the missing values and thereby restore 
the ability to use standard complete-data methods to draw inferences. 


1.1. Imputation 


It is not surprising, therefore, that a very common method of handling the missing values 
created by nonresponse is to fill them in, or impute them. That is, when using imputation 
to handle nonresponse each missing value is replaced with a real value. Many different pro- 
cedures have been proposed for imputation, for instance, filling in the respondents’ mean 
for that variable or a value predicted from the modelling of the missing variable given observed 
variables using respondent data; as a specific example, when the missing value is personal 
income, a linear regression model predicting log(income) from demographic characteristics 
such as age, sex, education and occupation might be regarded as reasonable. 


1.2 Advantages and Disadvantages of Single Imputation 


In addition to the obvious advantage of allowing complete-data methods of analysis, im- 
putation by the data collector (e.g. the Census Bureau) also has the important advantage 
of being able to utilize information available to the data collector but not available to an 
external data analyst such as a university social scientist analyzing a public-use file. This in- 
formation may involve detailed knowledge of interviewing procedures and reasons for 
nonresponse that are too cumbersome to place in public-use files, or may be facts, such as 
street addresses of dwelling units, that cannot be placed on public-use files because of con- 
fidentiality constraints. This kind of information, even though inaccessible to the user of 
a public-use file, can often narrow the possible range of imputed values. 
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Just as there are obvious advantages to imputing one value for each missing value, there 
are obvious disadvantages of this procedure arising from the fact that the one imputed value 
cannot itself represent any uncertainty about which value to impute: If one value were really 
adequate, then that value was never missing. Hence, analyses that treat imputed values just 
like observed values generally systematically underestimate uncertainty, even assuming the 
precise reasons for nonresponse are known. Equally serious, single imputation cannot repre- 
sent any additional uncertainty that arises when the reasons for nonresponse are not known. 


1.3 Multiple Imputation to the Rescue 


Multiple imputation, first proposed in Rubin (1977, 1978), retains the two major advan- 
tages of single imputation and rectifies its major disadvantages. As its name suggests, multi- 
ple imputation replaces each missing value by a vector composed of M = 2 possible values. 
The M values are ordered in the sense that the first components of the vectors for the miss- 
ing values are used to create one completed data set, the second components of the vectors 
are used to create the second completed data set and so on. The first major advantage of 
single imputation is retained with multiple imputation, since standard complete-data methods 
are used to analyze each completed data set. The second major advantage of imputation, 
that is, the ability to utilize data collectors’ knowledge in handling the missing values, is not 
only retained but actually enhanced. In addition to allowing data collectors to use their 
knowledge to make point estimates for imputed values, multiple imputations allow data col- 
lectors to reflect their uncertainty as to which values to impute. This uncertainty is of two 
types: sampling variability assuming the reasons for nonresponse are known, and variability 
due to uncertainty about the reasons for nonresponse. Under each posited model for 
nonresponse, two or more imputations are created to reflect sampling variability under that 
model; imputations under more than one model for nonresponse reflect uncertainty about 
the reasons for nonresponse. The multiple imputations within one model are called repeti- 
tions and can be combined to form a valid inference under that model; the inferences under 
different models can be contrasted to reveal sensitivity of answers to posited reasons for 
nonresponse. 

Before reviewing some more general results in Section 3, Section 2 illustrates essential ideas 
in a highly artificial example used in Rubin (1986a), which is a comprehensive treatment of 
multiple imputation. Other references on multiple imputation include Rubin (1979, 1980, 
1986b), Herzog and Rubin (1983), Li (1985), Schenker (1985), Rubin and Schenker (1986), 
and Heitjan and Rubin (1986). 


2. AN ARTIFICIAL EXAMPLE ILLUSTRATING MULTIPLE IMPUTATION 


Suppose we have taken a simple random sample of n = 10 units from a large population. 
The objective of the survey is to estimate Y the mean of Y in the population. We know the 
mean value of a covariate X in the population, and the survey attempts to record both X 
and Y for each of the 7 units included in the sample. 

Table 1 presents the observed values of (Y, X) for the ten units in the sample where the 
question marks indicate missing Y data due to nonresponse. 


2.1 Multiply Imputing for the Missing Values 


Suppose the missing values in Table 1 are to be multiply imputed using two values drawn 
under each of two models (i.e. two repetitions per model). In general, any number of models 
can be used with any number of repetitions within each model. Model | is an “‘ignorable’’ 
model for nonresponse; ignorable is defined precisely in Rubin (1976), but essentially it means 
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that a nonrespondent is only randomly different from a respondent with the same value of 
X. Model 2 is a nonignorable model and posits a systematic difference between respondents 
and nonrespondents with the same value of X. The repeated imputations under each model 
are based on a simple procedure closely related to the hot-deck, which can be improved upon 
but is useful to illustrate ideas. 

For each nonrespondent, the two closest matches among the respondents are found, where 
the distance for matching is defined by the values of X. For the first nonrespondent, unit 
2, the two closest matches are units 1 and 3, and for the second nonrespondent, unit 4, the 
closest matches are 3 and 5. The repeated imputations are created by drawing at random 
from the two closest matches. For the ignorable model, we simply impute the value Y pro- 
vided by the matching respondent: the first two columns of Table 2 give the result. For the 
nonignorable model, we suppose that the nonresponse bias is such that a nonrespondent will 
tend to have a value of Y 20% higher than the matching respondent’s value of Y: the last 
two columns of Table 2 give the result where the Y values have been rounded to the nearest 
integer. The repeated imputations within each model allow the user to draw a valid inference 
under that model. The use of two models, an ignorable one and a nonignorable one, allows 
the display of sensitivity of inference to assumptions about nonresponse. Generally such 
assumptions are untestable using the data at hand. 


Table 1 
Observed Data 


Unit ye x 
1 10 8 
2 ? 9 
3 14 11 
4 ? 13 
5 16 16 
6 15 18 
a 20 
8 4 4 
9 18 20 

10 22 25 

Table 2 
Multiple Imputations for Data of Table 1 
Model 1 Model 2 
Repetition Repetition 
1 2 1 2 
Unit 2 10 14 2 17, 
Unit 4 16 14 19 17 
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2.2 Analyzing the Resultant Multiply-Imputed Data Set 


Each set of imputations, that is each column of Table 2, can be used with the incomplete 
data in Table 1 to create a completed data set. Since there are four sets of imputations, four 
completed data sets can be created; these are displayed in Tables 3 to 6. Each completed 
data set is analyzed just as if there had been no nonresponse. 

Assume that with complete data, the ratio estimator Xy/X would be used with associated 
variance SE*, where X is the known mean of X in the population, say 12, p and X are the 
means of Y and X in the random sample of n units, and 


SE? = 1 (Y¥; — Xiy/%)?/[n(n — 1)] 


Table 3 


Complete Data Set 1 (Model 1, Rep. 1) 
For Multiply Imputed Data Set of Tables 1 and 2 


Unit 4 xX 
1 10 8 

iD, 10 
3 14 11 
4 16 13 
5 16 16 
6 15 18 
fj 20 6 
8 4 4 
9 18 20 
10 22 ZS 
means 14.5 3 

Table 4 

Complete Data Set 2 (Model 1, Rep. 2) 
For Multiply Imputed Data Set of Tables 1 and 2 

Unit VA xX 
1 10 8 
2 14 9 
3 14 11 
4 14 13 
5 16 16 
6 15 18 
7 20 6 
8 4 4 
9 18 20 
10 22 25 


means 14.7 13 
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Table 5 


Complete Data Set 3 (Model 2, Rep. 1) 
For Multiply Imputed Data Set of Tables 1 and 2 


S 
= 
E. 


1 


10 
12 
14 
19 
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— 
ON 


— 
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N 
N 


means ie 


Table 6 


Complete Data Set 4 (Model 2, Rep. 2) 
For Multiply Imputed Data Set of Tables 1 and 2 


S 
= 
=e 


a 


Oo ONIN N FRWN 
— 
OO 


— 
co) 
N 
N 


means 1523 


Table 7 


Ratio Estimates and Associated Variances of Estimates 
for the Complete Data Sets of Tables 3-6 


Model 1 
Repetition 


Estimate 13.38 Sa 
Variance 2.96 3.19 


Model 2 
Repetition 


41 


42 Rubin: Multiple Imputation 


Table 8 


Combined Estimates and Variances for the Multiply 
Imputed Data Sets of Tables 1 and 2 


Model 1 Model 2 
Estimate 13.48 13.98 


Variance 3.10 3.66 


where the sum is over the units in the sample. Table 7 presents the estimates and variances 
associated with each of the four completed data sets given in Tables 3-6. 

The two answers obtained under the same model can be combined to obtain one inference 
for Y under each model. The results are displayed in Table 8: the estimate is the average 
of the estimates and the variance associated with this estimate has two components: (i) the 
average within-imputation variance associated with the estimate and (ii) the between- 
imputation variance of the estimate. Thus, under Model 1, the estimate is 
(13.38 + 13.57)/2 = 13.48; the associated estimated average within variance is (2.96 + 
3.19)/2, and the associated estimated between variance is [(13.38 — 13.48)? + (13.57 — 
13.48)*)]. The estimated variances are combined as: (estimated total variance) = (estimated 
average within variance) + (1 + M~!) x (estimated between variance), where the factor 
(1 + M~') multiplying the usual unbiased estimate of between variance is an adjustment 
for using a finite number of imputations. The associated 95% interval estimate for Y is (10.0, 
16.9) under Model 1 and (10.2, 17.7) under Model 2. In practice, better intervals can be formed 
by calculating degrees of freedom as a simple function of the variance components and us- 
ing the 95% points appropriate to the corresponding f-distribution; when either M is large 
or the between variance component is small relative to the total variance (as in this artificial 
example), the degrees of freedom will be large and thus the normal 95% points will be used. 
Details are given in Section 3. 

The essential feature to notice in this illustrative example is that only complete-data methods 
of analysis are needed. We merely have to perform the complete-data analysis that would 
have been used in the absence of nonresponse on each of the completed data sets created 
by the multiple imputations. The resultant answers under each model are then easily com- 
bined to give one inference under each model. Although not illustrated here, diagnostic 
analyses using complete-data techniques can be applied to each completed data set; Heitjan 
and Rubin (1986) provides several examples. 


3. GENERAL PROCEDURES 


The example in Section 2 illustrated methods for creating multiple imputations and analyz- 
ing the resultant multiply-imputed data set in a special case. We now outline the methods 
needed for general practice. 


3.1 Proper Imputation Methods 


Multiple imputations ideally should be drawn according to the following general scheme. 
For each model being considered, the M imputations of the missing values, Y,,;,, are M 
repetitions from the posterior predictive distribution of Y,,;,, each repetition being an 
independent drawing of the parameters and missing values under an appropriate Bayesian 
model for the posited response mechanism. In practice, implicit models such as illustrated 
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in Section 2 can often be used in place of explicit models. Both types of models are illustrated 
in Herzog and Rubin (1983), where repeated imputations are created using an explicit regres- 
sion model and an implicit matching model, which is a modification of the Census Bureau’s 
hot-deck. 

Procedures that incorporate appropriate variability among the repetitions within a model 
are called proper, which is defined precisely in Rubin (1986a). The essential idea of proper 
imputation methods is to properly reflect sampling variability when creating repeated im- 
putations under a model. For example, assume ignorable nonresponse so that respondents 
and nonrespondents with a common value of X have Y values only randomly different from 
each other. Even then, simply randomly drawing imputations for nonrespondents’ from mat- 
ching respondents’ Y values ignores some sampling variability. This variability arises from 
the fact that the sampled respondents’ Y values at X randomly differ from the population 
of Y values at X. Properly reflecting this variability leads to repeated imputation inferences 
that are valid under the posited response mechanism. 

In the context of simple random samples and ignorable nonresponse, Rubin and Schenker 
(1986) study hot-deck imputation (i.e. simply randomly drawing imputed values from 
respondents), which is not proper, and a variety of proper imputation methods based on 
both explicit and implicit models, including a fully normal model, the Bayesian Bootstrap 
(Rubin, 1981), and an approximate Bayesian Bootstrap. The Approximate Bayesian Bootstrap 
(ABB) can be used to illustrate how an intuitive imputation method, such as the simple ran- 
dom hot-deck, can be modified to be proper. 


3.2 Example of a Proper Imputation Method with Ignorable Nonresponse - The ABB 


Consider a simple random sample of size n with ng respondents and nyg = n — Np 
nonrespondents. The ABB creates M ignorable repeated imputations as follows. For 
¢ = 1, ..., M, create n possible values of Y by first drawing n values at random with replace- 
ment from the np observed values of Y, and second drawing the nap missing values of Y 
at random with replacement from those n values. The drawing of the nyp missing values 
from a possible sample of 1 values rather than the observed sample of np values generates 
appropriate between imputation variability, at least in large samples, as shown by Rubin 
and Schenker (1986). The ABB approximates the Bayesian Bootstrap by using a scaled 
multinomial distribution to approximate a Dirichlet distribution. 


3.3 Analysis - The Repeated Imputation Inference 


The general methods for analyzing a multiply imputed data set implicitly assume proper 
imputation methods have been used to create the multiple imputations. As illustrated in Sec- 
tion 2, the repeated imputations within each model are analyzed as a collection to create 
one repeated-imputation inference as follows. Each data set completed by imputation is analyz- 
ed using the same complete-data method that would be used in the absence of nonresponse. 
More precisely, let 6, U,,f = 1, .... M be M complete-data estimates and their associated 
variances for a parameter 9, calculated from the M data sets completed by repeated imputa- 
tions under one model for nonresponse. The final estimate of 0 is 


M 
e=1 
The variability associated with this estimate has two components: the average within- 
imputation variance, 


M 
f=1 
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and the between-imputation component, 
By = yionils — Oy)*/(M-1) 


where with vector 0, (¢)* is replaced by (e)'(*). The total variability associated with Oy is 
then 


Ty = Uy + (1 + M7')By. 


With scalar ©, the reference distribution for interval estimates and significance tests is a 
t-distribution. 


(0 —Oy) Ty? ~ 
where the degrees of freedom, 
fe ay Gay k pesatl Cs lla ig A de eal hleee viegrray Wiig Wee tk 


is based on a Satterthwaite approximation (Rubin and Schenker 1986 and Rubin 1986a). 
The within to between ratio U,,/ By estimates the population quantity (1 — y)/7, where 
y is the fraction of information about © missing due to nonresponse. In the case of ignorable 
nonresponse with no covariates, y equals the fraction of data values that are missing. 


3.4 Significance Levels for Multicomponent 0 


For © with k components, significance levels for null values of © can be obtained from 
M repeated complete-data estimates, ©,, and variance-covariance matrices, U,, using 
multivariate analogues of the previous expressions. 

A simple procedure described in Li (1985) and Rubin (1986a) that works well for M large 
relative to k is to let the p-value for the null value @, of © be Prob {Fx,, > Dy} where 
F,., is an F random variable and Dy = (9 — Om) Ty' (Qo — Oy) with v defined by 
generalizing By/Uy to be the average diagonal element of Bore trace (By,;Uy7' ) / K. 
Better procedures are described in Rubin (1986a). Less precise p-values can be obtained directly 
from M repeated complete-data significance levels; also see Rubin (1986a). 


4. DISCUSSION 


4.1 Frequency Evaluations 


Although repeated imputation inferences are most directly motivated from the Bayesian 
perspective, they can be shown to possess good frequency properties. In fact, the definition 
of proper imputation methods means that in large samples infinite-M repeated imputation 
inferences will be valid. Since the finite-/ adjustments are derived using approximations 
to Bayesian posterior distributions, however, deficiencies can arise with finite M. For exam- 
ple, the large sample relative efficiency of Oy to O,, that is, the efficiency of the finite-M 
repeated imputation estimator using proper imputation methods relative to the infinite-W/ 
estimator in units of standard errors is (1 + y/M) ~ 1/2 Even for relatively large y, modest 
values of M result in estimates O,, that are nearly fully efficient. 
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4.2 Confidence Coverage 


In large samples the confidence coverage of proper imputation methods using the f-reference 
distribution can be tabulated as a function of M, y and the nominal level, 1 — a. Table 
9 is from Rubin (1986a) and is also partially reported in Rubin and Schenker (1986) and 
Schenker (1985). Also included are results for single imputation, where the between compo- 
nent of variance is set to zero, since it cannot be estimated, and the reference distribution 
is the normal, since v cannot be estimated without By. Even in extreme cases, two or three 
repeated imputations yield nearly valid confidence coverages; this is in striking contrast to 
using only one imputation. Even worse coverages for single imputation would have been 
obtained using best prediction methods, such as ‘‘fill in the mean’’. 


Table 9 


Coverage probabilities in % of interval estimates based on the f-reference distribution as a function 
of the number of proper repeated imputations, M = 2, the fraction of missing information, y, and 
the nominal level, 1 — a. Also included for contrast are results based on single imputation M = ip 
using the normal reference distribution with the between component of variability set to zero. 
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4.3 Significance Levels 


Work on accurately obtaining significance levels is at an early stage of development. Table 
10 is from Rubin (1986a) and is also partially reported in Li (1985). It indicates that if M > k 
and y is modest, accurate tests can be obtained using Dy. Better procedures are considered 
by Li (1985), Rubin (1986a) and in current thesis work by T.E. Raghunathan. 


Table 10 


Level in % of Dyy with Fx, , reference distribution as a function of: nominal level, a; number of com- 
ponents being tested, k; number of repeated proper imputations, M; and fraction of missing informa- 
tion;.7. 
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5. CONCLUSION 


In conclusion, multiple imputation is a very promising new tool for helping to handle 
nonresponse in surveys. Although much work remains to be done before it will become a 
commonplace method, many interesting theoretical and practical results suggest effort ex- 
pended in its development will be well rewarded by important contributions to applied work. 
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Imputation Options in a Generalized Edit 
and Imputation System 


P. GILES and C. PATRICK! 


ABSTRACT 


Statistics Canada has undertaken a project to develop a generalized edit and imputation system, the 
intent of which is to meet the processing requirements of most of its surveys. The various approaches 
to imputation for item non-response, which have been proposed, will be discussed. Important issues 
related to the implementation of these proposals into a generalized setting will also be addressed. 


KEY WORDS: Modularity; Prototyping; Donor imputation; Regression models. 


1. GENERALIZED SYSTEMS 


Due to resource constraints imposed on surveys in recent years, especially in the area of 
development, the idea of generalized software has received considerable support. By generalized 
software, it is meant a set of computer programs, tied together into one system, which allows 
the user to select a suitable approach to the problem, from among several alternatives. For 
example, a user has a data file from which a sample of records is to be selected. A generalized 
sample selection system would offer the user the choice of various sampling schemes such 
as simple random or unequal probability sampling (with or without replacement), systematic, 
stratified, or cluster sampling. 

A genuinely generalized system is, almost by definition, a complex object. The concept 
of modularity is an important device for the reduction of complexity, by allowing the overall 
task to be split into a number of simpler sub-tasks. Each of the sub-tasks, or functions, is 
performed sequentially. The user is offered several alternatives for each sub-task. Therefore, 
not only is the overall task able to be split into smaller, more manageable components, but 
also each sub-task can be performed in more than one way. 

Figure 1 demonstrates how the edit and imputation task can be split into three sub-tasks. 
These three sub-tasks are editing, identification of fields to impute, and imputation. Each 
of the boxes, or modules, in a row employ different approaches to that particular sub-task. 
For example, Cl could employ some type of donor imputation, C2 could employ the imputa- 
tion of a mean value, and so on. The user would select one of the modules from each of 
rows A, B, and C. 

It should be noted that this representation of a generalized system for edit and imputation 
is not the only possibility. In fact, the actual proposal for a developmental project actually 
contains five sub-tasks, as opposed to the three exemplified here. This representation is given 
only for simplicity. 

Each sub-task, or row in the example, would be a clearly defined function. The input files 
required, and the output files created, must have prespecified formats. This allows the user 
to concentrate on the choice of modules in each row, knowing that the system can handle 
the “housekeeping”. (This refers to file handling and other mundane details about which the 
user would prefer not to worry.) Even though the system may accept all possible combina- 
tions of choices of modules, some combinations may not be desirable or even logically valid. 
It is usually the responsibility of the user to ensure that the pieces fit together. 


! Philip Giles and Charles Patrick, Business Survey Methods Division, Statistics Canada, Tunney’s Pasture, Ottawa, 
Ontario, Canada, K1A OT6. 
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Figure 1. Generalized System Example - Edit and Imputation 


A modular approach to the development of a processing system has an important conse- 
quence. From a certain point of view, the system is always ‘‘under development’’, since ad- 
ditional modules embodying new approaches and enhancements to ‘‘old’’ modules, can 
always, and in principle should, be added. This open-endedness also means that the very 
important concept of prototyping can be easily accommodated. Prototyping is an approach 
wherein a subset of modules are developed initially. The system would then be available to 
some of the users. Subsequently, additional modules are developed to meet the requirements 
of additional users. Thus, the key advantage of prototyping and modularity is that piecemeal 
improvements to the system are deliberately anticipated and more easily accomplished. A 
minimal, but imperative, requirement of such an approach is that a framework (as shown 
in Figure 1) and a host environment (format of data files and programming language) must 
be carefully defined and specified very early in the overall developmental process. 

In addition to the foregoing developmental advantages, others may be gained after the 
system is in place. The user has considerable flexibility in choosing the path to proceed. If 
several alternatives seem equally viable, one can use historical data to choose among them, 
by testing the various alternatives prior to data collection. This can be accomplished without 
an undue expenditure of effort. Once the generalized system is developed there is a reduc- 
tion in resource requirements for each of its users, with a corresponding reduction in elapsed 
time to implementation. 

There are some disadvantages to following a generalized route. The utilization of generaliz- 
ed software in a production environment may be less efficient than the corresponding custom- 
designed system. The initial resource requirement will be higher for a generalized system as 
compared to a customized system. However, this higher cost must be assessed against the 
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substantially higher costs of repeated custom-designed implementations. Nor is it reasonable 
to expect a generalized system to satisfy every specific requirement. In this situation, the 
user has two options. The first option is to develop a user-written module. This would not 
require the same degree of effort as a complete customization. However, if this occurs fre- 
quently, the purpose of the generalized system is defeated. The second option is for the user 
to modify the specifications in order to fit the generalized system mold. If the system has 
been well-designed, any required compromise should not result in a serious deterioration 
of data quality. It should also be recognized that compromises to the original specifications 
are usually and frequently required during the development of a customized system. 


2. BACKGROUND TO IMPUTATION 


The term “‘imputation’’, in this document, refers to a certain class of procedures for 
handling non-response. The input is a data captured file. The imputation procedure creates 
a file with individually ‘‘clean’’ records; a ‘‘clean’’ record being one which has no missing 
values and which satisfies all the specified edits. In order to create a clean record, a value 
must be estimated for each missing value. 

The edits, specified by the user, are logical constraints on the values that each variable 
can assume. The set of edits, as a whole, define the acceptance region for the data. For 
categorical data, an edit is specified as a set of combinations of acceptable data values. The 
acceptance region can be represented as a set of lattice points in N-space. For numerical data, 
an edit is a linear equality or inequality. The requirement of linearity is not unduly restric- 
tive, since a non-linear edit can be made linear by either algebraic manipulation or by adding 
supplementary variables, which are suitably defined non-linear functions of survey variables. 
The acceptance region for numerical data is a set of convex regions in N-space. The reason 
that there may be more than one convex region is that conditional edits are possible. Condi- 
tional edits are edits which pertain to only a subset of records. For example, the edits which 
are relevant to a particular record may be very different, depending on whether the variable 
Sex is recorded as Male or Female. 

If one or more edits fail for a particular record, it may not be obvious which variable(s) 
is/are in error, and, by implication, to be imputed. For example, a failed edit is A + B < C. 
The data record under consideration has data values A = 10, B = 5, C = 12. There are 
seven combinations of variables to change which would result in a clean record. These are 
A,B,C, A&B, A&C, B&C, and, A & B & C. Without any other information or deci- 
sion rule, each of these choices is equally valid. The problem of how to decide which variable(s) 
to impute will not be discussed in this document. It will be assumed that, for each record, 
the variable(s) to impute have been identified. No distinction is made between variables to 
impute due to missing values and variables to impute due to edit failures. 


3. PROPOSED IMPUTATION TECHNIQUES 


This section is comprised of four sub-sections, which define all the proposed imputation techni- 
ques. These are Deterministic Imputation, Donor Imputation, Regression Models, and Other 
Imputation Estimators. The use of regression models and the section on other estimators 
is restricted to numerical data. The other two sub-sections apply both to numerical and 
categorical data. 

Almost all imputation techniques can be formulated in a prediction framework, describ- 
ed by Rubin (1976), as follows. A joint distribution, f(X, .... X\y), summarizing the 
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statistical behavior of the population of complete records is specified. This can be done whether 
the individual variables are quantitative or qualitative. Without loss of generality, for a record 
i which requires imputation, the N variables can be partitioned into Xj, ..., Xm,, which re- 
quire imputation, and Xj»... ..., Xn; which do not require imputation. A conditional 
distribution f(X,, ..., XG Xmjy pore Xx) can be derived. Imputed values, yj, ..., Vm,» are 
chosen for Xj, ..., Xm, from the set. 


Ree ee igre Varin Xr cngparaaess oN uct tvs 


Various selection mechanisms can be employed. However, as stated above, some of these 
are relevant only to certain types of data variables. 

It should be noted that there is nothing new or radically different in these proposals. They 
are based on work done previously, both in Statistics Canada and outside. The discussion 
on donor imputation is based on Fellegi and Holt (1976). The model-based approach to deter- 
mining a value to impute is discussed by Little (1982). Other related papers of interest are 
Sande (1976), Kalton and Kasprzyk (1982), and Kalton and Kish (1981). 


3.1 Deterministic Imputation 


The first type of imputation is called deterministic imputation. This occurs when only 
one value can satisfy the edits. If more than one variable is to be imputed for a particular 
record, a deterministic solution may be possible for some, or all, variables. The check for 
determinacy should be done before proceeding to other imputation procedures. 

Deterministic imputation may arise in very simple, and easily detectable situations. For 
example, suppose that there is an edit A + B = 10. The record under consideration requires 
A to be imputed and B has value 6. Obviously, A = 4 is the only value which will satisfy 
the edit. Another example demonstrates this for categorical variables. Suppose an edit is 
stated as ‘‘If the relationship to the household reference person is wife, then sex must be 
female.’’ If the reference record has ‘‘wife’’ as the value of ‘‘relationship to the household 
reference person’’, and the variable ‘‘Sex’’ requires imputation, then the only valid imputed 
value is Sex = Female. 

However, a typical survey situation will have several edits, rather than just one. This may 
mean that an existing deterministic solution may not be apparent. The procedure for check- 
ing for deterministic imputation is to find the reduced acceptance region defined by the ac- 
tive edits and the ‘‘good’’ data values. The active edits are defined as the subset of edits 
in which the variable(s) to be imputed are participant. This can also be expressed in the nota- 
tion of the prediction framework given at the beginning of Section 3. The conditional distribu- 
tion f(X), .... Xm; | Xmj4 po +009 XN) will specify a unique value for some or all of the 
variables Xj, ..., Xm;. 

An example serves to illustrate the procedure for identifying deterministic imputation. 
Note that while the example is written with numerical variables, an analogous situation ex- 
ists for categorical variables. 

There are three edits: 


XetkuY otul6; 


b Ar A 


lA 
- 


PGs eA 


lA 
ore 


Survey Methodology, June 1986 53 


The reference record has values 


x Shbl andivi=3: 
The variable Z is to be imputed. 


It is not apparent whether or not a determinancy exists. This first step is to consider all 
active edits. In the example, there are two edits which contain the variable Z. 
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Next, the known values of X and Y are inserted into these edits, and the reduced accep- 
tance region is determined. 
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Solving these inequalities gives the following solution. 


It is now obvious that Z = 1 is the only possible valid imputed value. 

In most “‘real-life’’ situations, the incidence of deterministic imputation should be low. 
The contrary would indicate that the edits are more restrictive than necessary or desirable, 
and should lead to a re-examination of the edit specifications. However, in the sense that 
it reduces the imputation problem, deterministic imputation is a useful first step. 


3.2 Donor Imputation 


Donor imputation is a method which pairs each record requiring imputation, the can- 
didate record, with one record from a defined donor population. In order to determine the 
value to impute, one approach is to directly copy the value from the donor record onto the 
candidate record. For numerical variables, if suitable auxiliary information is available, more 
complex methods may be used to determine the value to be imputed. Further discussion on 
imputation estimators for donor imputation is given in Section 3.3. 

Usually, the donor population is defined as all records in the current survey which have 
no variables to be imputed. Referring to the prediction framework described at the beginn- 
ing of Section 3, then this situation implies that f(X,, ..., Xj) is the empirical probability 
function. However, other approaches to defining the donor population are possible. For the 
remainder of the discussion on donor imputation, it will simply be assumed that a donor 
population has been defined. 

Donor-candidate pairs are formed using matching variables. Matching variables are defined 
as variables which do not require imputation on the candidate record and are ‘‘highly cor- 
related’’ with the variable(s) requiring imputation. Preferably, the matching variables should 
also have “‘low correlation’’ with each other. Two matching variables with ‘‘high correla- 
tion’’ would have the same discriminatory power as one alone, but would have the effect 
of doubling the weight given to one alone. 
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For categorical variables, a donor record is chosen, using some random process, from 
amongst potential donor records having the same values for the matching variables to those 
for the candidate record. Since numerical variables can assume many more values than 
categorical variables, it is very unlikely that an exact match on matching variables would 
be possible. Therefore, for numerical data, a distance function is used to define similarity. 
This distance function is a function of the matching variables on the candidate and potential 
donor records. The chosen donor is the record with minimum distance from the candidate 
record. Usually, the matching variables are transformed for the purpose of distance calcula- 
tions in order to remove the effect of scale in which the variable is recorded. For example, 
it would be quite worrisome to the user if the formation of the donor-candidate pairs was 
dependent on whether a length variable was recorded in metres or feet. The proposed transfor- 
mations and distance functions are discussed below. 

The matching variables to be used can be a user input, or determined by an automated 
procedure. Usually, due to time considerations, all decisions must be made prior to data 
collection. Therefore, if the determination of matching variables is a user input, the user 
must specify the matching variables for each pattern of variables to be imputed. If there 
are N variables on the file, the user must make (2**N) — 2 input specifications. Obviously, 
the value of N does not have to be very large in order for this approach to become un- 
manageable. In order to reduce this number, the matching variables may be specified by 
stratum. All candidate records in a particular stratum would use the same matching variables. 
In this situation, it is possible (depending on how careful the user is in specifying the mat- 
ching variables) that a particular candidate record may have a matching variable which re- 
quires imputation. All in all, the user who inputs the matching variable specifications, is 
warned that this decision may result in a large increase in the work required. 

One possible approach for automatically determining the matching variables is proposed. 
This procedure can be used, analogously, for both categorical and numerical data. Basical- 
ly, the procedure is as follows. At a minimum, the set of matching variables must contain 
the variables sharing in the edit rules with the variables to be imputed. As defined earlier, 
these are the active edits. This approach seems intuitively reasonable, since it is desirable 
that the matching variables be correlated with the variable(s) to be imputed. The variables 
in the active edits constrain the range of possible values to be imputed. This implies a type 
of dependence, or correlation structure. 

The use of this matching procedure, together with direct transcription, has one important 
consequence for categorical variables. All imputed values are guaranteed to pass the edits. 
This is very important as it is required in order to create a clean record. Without this guarantee, 
the user must re-edit the records, and possibly adopt a secondary imputation procedure. For 
numerical data, similarity as defined by a distance function does not guarantee this outcome. 
However, the closer the distance between the donor and candidate record is to zero, the greater 
the probability that the imputed values will satisfy the edits. 

The determination of matching variables using this automated procedure can be illustrated 
by an example. 

There are five edits: 
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There are five survey variables A, B, C, D, E and a, a», Q3, Q4, as are known scalars. 

The candidate record under consideration has variable B only to be imputed. 

The first step is to identify the active edits. In this example, there are three active edits. 
These are edits I, II, and V. 

The second step is to determine the active variables. The active variables are defined as 
all variables which are contained in at least one of the active edits. In the example, there 
are four active variables: A, B, C, E. Note that, by definition, the active variables contain 
all variables to be imputed. 

The third step is to determine the matching variables, as those active variables which do 
not require imputation. For this example, the matching variable are A, C, E. 

In addition to the determination of matching variables, donor imputation for numerical 
data requires the choice of a data transformation and the choice of a distance function. 

Two types of data transforms are proposed. For both of these, each variable is to be 
transformed independently. The two proposed transformations are a rank value transform 
and a location-scale transform. 

For the rank value transform, the values for each variable are sorted. Then, the rank values 
are divided by a suitable constant such that all values are in the range from zero to one. 
The transformed values are distributed uniformly in that range. 

The location-scale transform is of the form, 


where y’ _ is the transformed value, 
y is the original data value, 


a, b- are user-specified parameters. 


Two popular choices for these constants are, one, that a be the sample mean and b be 
the sample standard deviation, and, two, that a be the sample minimum and b be the range 
of values in the sample. Other options may be possible. 

In choosing a data transform, there are robustness and outlier considerations. The rank 
value transform is very robust against changes in data values, and pulls outliers closer to 
the other data values. This may or may not be desirable. There are no bounds on the 
transformed values, using the location-scale transform with the mean and standard devia- 
tion. These parameters are also sensitive to outliers. The choice of the minimum value and 
range would restrict the transformed values between zero and one. However, these are very 
sensitive to extreme values. One very large value could cause all of the transformed values, 
except one, to be virtually zero. 

In considering the choice of distance function, a family of distance functions are propos- 
ed. These are the weighted £” norms, where p is a user-specified constant. The general form 
of these functions is 


DEG 3 We |X~ — vi? ES 
k=1 


where x;,, y, are the r matching variables on the two records, 
w, are user-specified weights, 


p is a user-specified constant. 
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The weights are used if one wishes some of the matching variables to contribute more 
to the distance calculation than others. The default values are for all weights to be set to one. 

Three particular choices of a value for p are of special interest, p = 1, p = 2, andp = o., 
For p = 1, this function calculates the city block distance. For p = 2, the Euclidean distance 
is calculated. The limiting case of this function, when p = o, yields the minimax distance. 
For this choice of p, the function is written as 


DAGY):= Khe [wy |X, — Yel]. 
<k<r 


One final point to be discussed about donor imputation is the concept of a ‘‘penalty’’ 
for donor usage. This penalty would reduce the number of times that a particular donor 
record is used. For donor imputation of categorical data, a donor record is selected from 
the donor population without replacement. This strategy has to be modified slightly if the 
size of the candidate population is greater than the size of the donor population. 

For numerical data, the distance function is modified by increasing the distance calcula- 
tion according to the number of times a particular donor is used. One possible approach 
is to use D’ (X, Y) to calculate distances, where 


EEX Va DCX, Vy Ke Cl aud): 


where uw is the ‘‘penalty’’ imposed by the user, 


d is the number of times that donor record has been chosen. 


An implication of the imposition of a penalty on the distance function, is that the choice 
of a donor record for each candidate record is now dependent on the order of the candidate 
records. 


3.3 Regression Models 


This section discusses imputation estimators which result from the use of regression models. 
For this discussion, only two models are used. These are: 


MODEL I: jy; 


09 aa Var(e;) =O, 


MODEL II: . y; = Bx; + & Wane) =07x: 


Note that these models are special cases of the more general formulation of regression 
models, which has the form 


YX: 


where E(e) = 0, V(e) = V 


Model II is used when auxiliary data is available. Otherwise Model I is used. Both models 
have one parameter to be estimated. Using least-squares, the parameter estimates are: 


& 


6 
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Before stating the various proposed estimators, some notation will be introduced. 
ec be the subscript for time ¢, the present survey, 


Vit be the variable under study for unit / and time ¢; this is the value to be imputed 
for candidate records, 


Xi, be the auxiliary variable (correlated with Y) for unit 7 and time f, 
R be the subscript for all non-respondents at time f¢ (i.e., y,;, is known), 
NR be the subscript for all non-respondents at time f (i.e., y;, is to be imputed), 


C, D be superscripts which denote either a candidate or donor record, whenever the 
distinction is required. 


Several explanatory notes are required along with the notation. First, R and NR are as 
defined in the current survey, regardless of the reporting history of each record. Second, 
the values for the variables y;(;_1), Xi X;(7-1) May themselves have been imputed. The only 
restriction is that they are not missing. Third, the notation does not include the concept of 
imputation classes. Imputation classes are essentially post-strata, in that they define sets of 
records which are judged homogeneous within, and heterogeneous between groups. However, 
both the notation and the imputation estimators are readily extendible to include imputation 
classes. 

Thus, estimators can be classified according to: 

(i) the choice of model, I or II, 

(ii) the imputation group, and, 

(iii) the variables in the regression used to estimate the parameter. 

The data on the records in the specified imputation group are precisely the data used to 
estimate the parameter(s) in the model. This concept allows considerable flexibility. For ex- 
ample, it could allow the preclusion of outliers from the calculation of the parameter estimate. 
After the parameter is estimated, it is used for prediction purposes to determine the imputed 
value. According to the notation, Y; is always the variable predicted. 

Based on the two models, eight imputation estimators are proposed. Even though there 
are eight proposed estimators, this list can be augmented in the future. These additional 
estimators could be derived, for example, by choosing other models, possibly incorporating 
more variables. 

Scanning the list of eight, one can see that these are the familiar imputation estimators 
that have been used traditionally. 


Estimator 1: The value from the previous survey for the same unit is imputed. y;(,_1) 
Estimator 2: The mean value from the previous survey is imputed. ¥,;_1) 
Estimator 3: The mean value of all respondents to the current survey is imputed. Y;p 


Estimator 4: The value is copied directly from the donor record to the candidate 


record, y? 
Estimator 5: A ratio estimate, using values from the current survey is imputed. 
YtR 
peeee vit 
X1R 


Estimator 6: A ratio estimate, based on values on the donor and candidate records is 
imputed. 
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Estimator 7: The value from the previous survey for the same unit, with a trend adjustment 
calculated from an auxiliary variable, is imputed. 


sii 
aig sb onal 
Estimator 8: The value from the previous survey for the same unit, with a trend adjustment 
calculated from the change in reported values to variable Y, is imputed. 
JiR 


PiipetiaD 


It is interesting to contrast the difference in estimators when one fixes all classification 
items but one. For example, the difference between estimators one and two is due only to 
the difference in choice of imputation group, as is also the case for estimators three and 
four, and, estimators five and six. The difference between estimators one and seven is due 
only to the choice of model. The same is true for estimators three and five, and, estimators 
four and six. It should also be noted that estimators four and six are those used in donor 
imputation, which were discussed in Section 3.2. 


3.4 Other Imputation Estimators 


The choice of imputation techniques is dependent upon the assumptions made by the user 
about the non-responding population. When using donor imputation, one assumes that there 
are some respondents which are similar to each non- respondent. If one imputes the mean 
from the current survey, the assumption is that the mean value of the respondents is the 
same as the mean value of the non- respondents. Similarly, one can go through all the 
estimators and list the implied assumptions. The first estimator proposed in this section tries 
to ease the somewhat restrictive (and usually untrue) assumptions required in the previous 
section. It pays for this by being more complex. It is called the chain-link estimator, given 
by Madow and Madow (1978). 

The derivation of this estimator is described. First, by assuming that the rate of change 
(trend) of the non-responding and responding populations are the same as observed in the 
previous survey, the population mean of the variable Y for the non-responding population 
in the current survey is estimated. 


vt YNR(t—1) — 


y = 
NRt YR) 


Rt 
One then determines the imputed value according to the auxiliary variable. 


_ Ynet 
Pe sak 
1 XNRt it 
_ YnR(t-1) — YRT 

XNRt YRGH1) 


Note that this amounts to a more complex application of the Regression Model approach 
discussed in Section 3.3. First, temporarily impute y;, = Yyp;, aS given above. Then, use 
Model II, and define the imputation group as being all non- responding records to the pre- 
sent survey for variable Y. The response variable is Y,. The regressor variable is X,. The 
resulting estimator is as given above. 
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The second estimator proposed in this section can be used when one has data on variable 
Y for several previous surveys. It does not use auxiliary variables, or data from other records. 
The behavior of each non-respondent is considered independently of others. This method 
is called exponential smoothing. It is a standard econometric forecasting technique. There 
is one user-specified parameter. It allows the flexibility of changing the relative contribution 
of the various data values. Algebraically, the estimator is given by 


dale perigss 
Jie oa ea ir als 
1-Al 


where 0 < A < 1, is prespecified. 


The closer A is to zero, the more weight is given to recent data. If ¢ = 1, this reduces 
to imputing the value for the previous survey. 


4. PAST WORK IN STATISTICS CANADA 


Statistics Canada has made efforts in the past to develop a generalized edit and imputa- 
tion system. Two of these will be highlighted, as they form the basis for the current pro- 
posal. These are the CAN-EDIT system and the Numerical Edit and Imputation System 
(NEIS). 


4.1 CAN-EDIT 


CAN-EDIT is itself, not a completely generalized system. However, the methodology that 
it employed is. The system is based on the work by Fellegi and Holt (1976) on imputation 
for categorical data. It was developed for processing the 1976 and 1981 Canadian Censuses 
of Population and Housing. 

CAN-EDIT adopted a donor imputation approach. The matching variables were deter- 
mined automatically, using the procedure described in Section 3.2. The CAN-EDIT system 
employed what it called primary and secondary imputation. If a candidate record could not 
be imputed in primary imputation, it was sent to secondary imputation. 

In primary imputation, all imputed values are taken from the same donor. The matching 
variables were determined based on all variables to be imputed. A record would fail primary 
imputation if no donor record had identical values on the matching variables. 

In secondary imputation, each of the variables to be imputed are treated independently 
and sequentially. The procedure for determining the matching variables is the same. However, 
by considering only one variable at a time, the number of matching variables will, in general, 
be less than under primary imputation. (There cannot be more, but the number may be the 
same). This implies that the potential donor population is larger. There are a few disadvan- 
tages to secondary imputation, as compared to primary imputation. First, it is possible to 
choose, as a matching variable, a variable which is to be imputed. There is no value to match 
on. Second, this approach does not make use of the joint distributions of the variables. The 
imputed values for two variables may satisfy the edits, each may be a very valid value, but 
which may occur in the population in combination only rarely. 


4.2 Numerical Edit and Imputation System (NEIS) 


The NEIS is a first prototype of a generalized E&I system for numerical data. It was writ- 
ten as a set of modules in the PSTAT statistical package. Subsequent prototypes have never 
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been developed. This system was developed by Gordon Sande (1979). It is felt that the 
methodology is very sound, and should be incorporated in a new system. However, PSTAT 
may no longer be a suitable software environment. The NEIS was used, in a production en- 
vironment, by the 1981 Farm Energy Use Survey. The methodology was employed in the 
development of the 1981 Census of Agriculture processing system. 

The NEIS, similar to CAN-EDIT, used a donor imputation approach with matching 
variables determined automatically using the procedure described in Section 3.2. However, 
as explained in that section, the determination of matching variables in this fashion for 
numerical data will not always result in the imputation procedure producing a clean record. 
The strategy adopted to reduce this problem is to select the closest r donors. If the closest 
donor does not impute values which satisfy the edits, then the next closest donor is con- 
sidered, and so on. 

The NEIS gave the user no choice of transformation or distance function. It used the 
rank value transformation and the weighted £* norm for distance calculations. 


5. CONCLUSION 


The proposals presented would allow considerable choice to a user of a generalized edit 
and imputation system. As mentioned, it does not close the door on additional approaches. 
However, it is felt that a system which is developed with these components would be suitable 
for a large number of users. It has been the experience of the authours that the ultimate 
power and usefulness of such a system is not apparent until one starts to use it. As testing 
proceeds, it becomes clear that there are more capabilities and extensions than first appear. 
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The Maximum Likelihood Method for 
Non-Response in Sample Surveys 
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ABSTRACT 


The analysis of survey data becomes difficult in the presence of incomplete responses. By the use of 
the maximum likelihood method, estimators for the parameters of interest and test statistics can be 
generated. In this paper the maximum likelihood estimators are given for the case where the data is 
considered missing at random. A method for imputing the missing values is considered along with the 
problem of estimating the change points in the mean. Possible extensions of the results to structured 
covariances and to non-randomly incomplete data are also proposed. 


KEY WORDS: Incomplete response; Missing at random; Maximum likelihood method; Imputation. 


1. INTRODUCTION 


Examples of non-response in sample surveys are in abundance. Various attempts with vary- 
ing degrees of success have been made in the literature to solve this problem. The success 
of a particular procedure is dependent on the complexity of the problem. For example, when 
the data is not missing at random, the problem is far from being solved. The recent attempts 
by Heckman (1976) and Greenlees et a/. (1982) among others, are highly sensitive to model 
misspecification. Similarly the hot-deck method has been severely criticized in the literature. 
However, when the sample size is large, the hot-deck method and a carefully designed regres- 
sion method yield similar results in imputing the non-response income in Current Popula- 
tion Survey (CPS). See David, Little, Samuhel and Triest (1986). 

The regression method is based on the assumption that the non-response is random, but 
unlike the hot-deck method does not require complete information from a previous census, 
which in a majority of cases is non-existent. Thus it appears that a carefully designed regres- 
sion method may be of great help. 

In this paper, the situation when the non-response is random is considered. Random non- 
response arises naturally in many situations. For example, in successive sampling, the sampling 
starts with a certain number of people from whom certain observations are obtained for a 
period of time. At the end of this period, some people are dropped from the survey and 
new people are added. The survey continues in this manner until completion. Examples of 
this nature are considered by Woolson, Leeper and Clarke (1978) and Woolson and Leeper 
(1980). 

Even when the non-response in not random, the non-random nature of the incomplete 
data may be accounted for, by using a sufficient number of explanatory variables in the regres- 
sion model and employing some of the techniques used in the hot-deck method as was done 
in David et al. (1986) for a univariate model. For example, in Section 2.5 a method for im- 
puting the missing values is given. 


! M.S. Srivastava, Department of Statistics, University of Toronto, Toronto, Ontario, Canada M5S 1Al, and E.M. 
Carter, Department of Mathematics and Statistics, University of Guelph, Guelph, Ontario, Canada NIG 2W1. 
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In the course of developing these results, a method will be derived for checking if there 
have been any changes over time in the response patterns. The models used can also be 
modified to include error variance-covariance matrices that are structured by the imposition 
of atime series to the reponse variables. In this paper it is assumed that the data are normal- 
ly distributed from a simple random sampling scheme and that the data are missing at ran- 
dom. If the normality assumptions is dropped then the estimators can no longer be considered 
maximum likelihood estimators but may still be considered as good heuristic estimators. 

In the next section, the form of the model will be described for the one sample problem. 


2. THE ONE SAMPLE PROBLEM 


2.1 The Model 


The bivariate incomplete data problem is considered first to introduce the general pro- 
cedure that follows. Let y = (, 2)’ be a bivariate random vector with mean vector p and 
covariance matrix £. Without loss of generality, the missing data in the bivariate situation 
can be described as follows: 


Vils see Jinyp Yi,nyt+b cory Vi,ny+ny oT lee Tee eres Fee ae (1) 


Vine ’ Y2ny> rea Tories yy eae JY2,ny+n2+b DOOR) Y2,ny+n2+N3 


That is, there are 1, pairs of observations, ny observations on y, with the corresponding 
observation on y> missing, and n; observations on y, with the corresponding observation 
on y, missing. Thus N = n, + ny + ny; observations are grouped into three subsets. If the 
complete data set were to be represented as y;, ..., yy, then the actual observed responses 
can be defined as - ix 


aj = Buy = Yy » forj = Lan my 


Za, = Boy; = Vij, for j = m + OER Ts eae 


and 


23) = B3y; = Oye for j = nh + Ny + li soon, M5 + Ny a N3, 


where B,; = J, the identity matrix, B, = (1 0) and B; = (01). 

For the general multivariate one sample problem, there will be K subsets of the data contain- 
ing n,, .... Nx observations. Note that the maximum number of groups is 2? — 1. Also the 
total sample size is N = mn, + ... + mx. If the k-th subset contains p, characteristics 
i, ..., ip,, then the matrix B, would be a py X p matrix with a one in the (s, i,) position 
for s = 1, ..., p, and zero elsewhere. With this notation the observed vectors of responses 
can be written as: 


Hence, 
E (2x) 


| 

& 
> 
is 


and 
cov(Z,j) = ByUBy, J = 1, ...5 Mk and eWieoliat teh 
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Example 1: (Data) 


Wei and Lachin (1984) give the cholesterol levels for a treatment group studied at times 
0, 6, 12, 20 and 24 months. For reasons not pertaining to the response variable, certain obser- 
vations were incomplete. The data can be grouped into K = 8 subsets. For the first group 
of complete data the sample mean and covariance matrix, based on 36 observations, were: 


226.6 1964 1301 1151 960 1008 
249.6 1301 1715 1109 1023 1199 
Z, = ZO2.6 | Sy = 1151 1109 1554 697 1266 
2551 960 1023 697 1148 667 
25007 1008 1199 1266 667 2546 


The data for each of the other subsets is given in Table 1 with the imputed values in paren- 
thesis. 


The matrices that define the model for the observed values are: 


00-00 L050 20 0 
0 EL BOF OO OFF =0U0R0 
B, = 1;, By = ’ B; = 
00.1 10-0 O04, 0.1 0 
OO OT 0 O20 “OF 0" 1 
1 SOO Can 6) 07050250 
LO T0020 
fea OM OROLOM Ne Bs = 41! O10 020 4.) Roo 
01000 
O20 EF OFO 00001 
B, = (10000), Bz = (0100 0). 


Now that the model is defined, estimation of the parameters and the imputation of the 
missing data can be performed. 


2.2 Estimation of the Population Mean Vector and Covariance Matrix. 


For each of the K subsets define the sample mean as 


nk 

= =r 

ZK = (Ng) NS Rijs 
fp. 
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Subset 2: nm = 7 
Subset 3: ny = 1 
Subset 4: Lael 
Subset 5: ns =1 
Subset 6: Ng =5 
Subset 7: nz =2 
Subset 8: ng = | 
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Table 1 


Observed Cholesterol Levels and Imputed Values 


Variable 


193 


201 
202 
209 
nA2 
276 
163 
239 
204 
247 
195 
228 
290 


221 


250 
r75 
260 
197 
248 


193 
256 


(284) 


2 


260 
(215) 


(327) 
(250) 
(327) 
(235) 
(286) 


(219) 
(294) 


(287) 


a 


Note: Total sample size is N = 65. 
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Then 
E'(Z,) = Byu, 
cov(Z,) = ng '(B,ZB{), 


and the Z, are independently distributed for k = 1, ..., K. Applying the least squares theory, 
we minimize 


K 
D tr ny (ByZBi) ~"[Z_ — Bul (Z — Byul'. 
k=] 


The solution for a given value of L is 


= 


x K 
b= \) 1 Bi (BEB; ) “a \) Bi (BEB; ) e (2) 


k=1 k=1 


If a normal distribution is assumed, then the least squares estimator is also the maximum 
likelihood estimator. Little (1982) has suggested the use of the EM algorithm for this pro- 
blem and claimed that the normal distribution assumption is not necessary. That is, estimators 
of » and ¥ can be defined as the solution of the normal likelihood equations even if the underly- 
ing population is not normal. These estimators cannot then be considered maximum likelihood 
estimators, but only heuristic estimators that are consistent under certain general conditions. 
However, if a normal distribution is not assumed, then there is no justification in maximiz- 
ing the normal likelihood equations to obtain estimators. An alternative heuristic estimator 
for L is given at the end of this section. The maximum likelihood estimator for L, assuming 
normality, are given from Srivastava (1985) as the solution of the following equation: 


K K 
H= Yi nBy (BXBL)~'B, — Y) By (BBL) ~'V;, (BLBi) ~1B, = 0, (3) 
k=1 (ee 


where 
Vie = (Zi — Bros... Zing eB Zi, Botts -a--:, Zkny — Bue)’. 


Methods for computing the solutions of (2) and (3) are given in Section 3. 


Note: Alternate estimators for the covariance matrix can be defined heuristically without 
the normality assumption. For example ¥ can be defined as the value of ¥ that 
minimizes 


K 
Dy ig BEB) Vie aniplg| (4) 
k=1 


However, the covariance matrix must be positive definite; therefore any expression that 
is minimized must yield a positive definite solution. If one of the groups contains complete 
data, then (4) will be infinite for any singular matrix L; hence, there will exist a minimum 
for (4) in the space of positive definite matrices. A similar argument holds for the maximum 
likelihood estimators. 
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2.3 Asymptotic Distribution of ji. 
From (2) it follows that jis asymptotically normally distributed with mean p and covariance 
matrix 


K 
P= [YY Bi (BEBE) Br)", (5) 


which can be estimated by P obtained from P by substituting the ¥ for L. Using this asymp- 
totic theory, tests of significance and confidence regions (intervals) for p or linear combina- 
tions of » can be obtained. Alternatively, the likelihood ratio tests given by Srivastava (1985) 
may be used for testing the hypothesis H: 1 = 0 against the alternative A: # 0. The 
likelihood ratio test rejects the null hypothesis H if 


= TU 222; | BS Beles eae 


where ¥ is the MLE of © under H and x, » is the upper 100a% point of a chi-square 
distribution with p degrees of freedom. 


2.4 Maximum Likelihood Estimates for Example 1 


The maximum likelihood estimates for example 1 were obtained as: 


226.82 180951220 L033 ee iS 13 
246.78 T2201 64250 002 Ue 
b= 252 02 and ms 1033 992 1438 718 1189 
ZOD 8732 10tiee 7 1S 125s Es 
25522 9134 T1214 7189 W915 * 2508 


The estimated covariance matrix for the estimate of the mean vector is 


28.05 18.78 15.96 13.46 14.08 ) 
18.7895 6795142, 15841751 

Rites 15.96 15.42 24.19 11.24 19.31 
13.46 15.84 11.24 23.33 15.38 


14 OS ith lel oh C538 SA | 


Inference on yw can be made from the asymptotic distribution of the estimators given in 
Secvione..3. 


2.5 Imputation 


The imputation of the missing data can be made from the conditional distribution of the 
unobserved data given the observed data. That is define the matrices C, for k = 1, ..., K 
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to be the complements of B,. That is for a p, < p matrix B, with ones as the (s, i,) en- 
tries for s = 1, ..., p, and 0’s elsewhere, the matrix C, is defined as the (p — px) X p 
matrix with ones in the (¢,i,) position and 0’s elsewhere for i, # i, for all 
f= 1, ..., (p — p,) ands =1, ..,{ peolf theresponse vector y,; corresponds to the j-th 
Bicom ation from subset k, then ake actual observed response vector is Z,; = By kj and the 
unobserved vector is uv Ux; = Cyy,;. The estimated value for the missing vector is given by 


ty = Cy + [(C.LBE](B XBL]! (Zu — Byit) (6) 


Note that the estimated values for the missing vector have no random error. If the data 
is to be used at a subsequent analysis, with these imputed values, as if it were a complete 
data set, then the estimated error covariance matrix will be too small. The problem of 
underestimating the covariance matrix can be overcome by adding in an appropriate residual 
e to the estimated value MKj- If the first subset of complete data is sufficiently large then 
the residual vectors for missing observations in subset k can be randomly drawn from the 
set of values 


(Cyyii — Cy) — (C, 2B, [B,2Bi]~' (By; — By) for i = 1, ..., m. (7) 


Example | (continued): 


The complete data set, including the imputed values based on (6) and (7) are given in Table 
1 for subsets 2-8 with the imputed values in parenthesis. 


3. COMPUTATIONAL PROCEDURES 


Equations (2) and (3) can be solved iteratively. A procedure using a combined Newton- 
Raphson and steepest ascent method is given in Carter (1986) for a general case that includes 
linearly restricted means and covariances. The procedure is a generalization of the one given 
by Hartley and Hocking (1971). The method can be described as follows. For an initial choice 
of L, say Ly, suppose 


LS ot A 


is a solution. This expression is substituted into (3) and the equation is then expanded in 
a series involving only the linear terms of A. The following approximate solution for A results. 
Define 


K 
y (D,@D, — Dy®Fe — Fe®Dx), 
k=1 


where A®B denotes the kronecker product of two matrices A and B defined by 
ASB = (a;B), 
Dis oB, Beer la is 
and 
F, = Bi (BXoBi) Vi BLoBi) ~'B 
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For any matrix A = (a), ... , @)’, we define vec(A) = (@,',).s., @g’)’. Then (3) can 
be written as approximnately 


Orvec(A)*= vec(E), 


where . 
Ee=*) Day 
k=1 


To insure the nonsingularity of QO, we shall write the solution for vec(A) as 
vec(A) = (Q + AI)! vec(E), (8) 


where XQ is allowed to vary with the algorithm but is initially set to a very small number. 
For a given value of L, jf is obtained from (2) and then a value of A is obtained from (8) 
to produce an updated estimate for L. The procedure is then iterated until a desired level 
of convergence is reached. 

The above method can be extended to more complex structured covariance matrices; 
however, the procedure does require the inversion of Q + XJ. For a large number of variables 
this matrix will be extremely large. In this instance the alternate method of solving (3) using 
the EM algorithm is preferable. Again the procedure is iterative, so calculations must be 
performed using the updated estimates of » and ¥ at each iteration. For an initial choice 
of L say Lo, define the complete predicted vector Vxj = By'Zj + Cy’ tej, where the 
predicted missing value jx; is given in (6). Then 


K n 
B= IN) YY dy 
ei 
Define the matrix V by 
K Nk 
Yt yee ren Oe ane 
k=1 j=1 
The updated estimate of L is then given by 
4 K 
Y= (U/N)(V + yy’ Cyl, 
k=1 


where H, is the conditional variance of the incomplete data given the observed data for the 
k-th class defined by 


Hy = CyECk — (CyZBj) (BLBi) 7! (B.ZC;). 


The procedure is then iterated. The EM algorithm is advantageous for those situations where 
there exists simple closed form solutions for the likelihood equations in the complete data 
situations. If a Newton-Raphson procedure is necessary to solve the complete data likelihood 
equations then little is gained from the EM algorithm. 
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4. A REGRESSION MODEL 


4.1 Incomplete Response Variables. 
The model discussed in section 2 can be extended to handle the regression situation. The 


data is again partitioned into K subsets. Then the following regression model is formed: 


Ze => Bi BA, ata €ks fork ie ar K, 


where Z, iS a Py X ny, matrix of observed values, G is a p X q matrix of unknown 
parameters, B, is as defined in Section 2, A, is the design matrix for the matrix Z, and the 
columns of e, are independently distributed with mean 0 and covariance matrix B,LB;. For 
a given L, the least squares estimator of 8 can be written from Carter (1986) explicitly as 


vec B = P7! vec(E), 


where a 
P= YQ) Bi (BEBi) Be @ AgAg', (10) 
Ke 
K 
Ja sy BeBe) OZ, AP (11) 
Kea 


The maximum likelihood estimator of ¥ is given by the same formula as (3), except that now 


Ve = [Zy — ByBAxI(Z, — BBA,’ . (12) 


The asymptotic distribution of 6 can be written in the form 


vec(B) ~ Nyg(vec(B), P~'). (13) 


Inference on the regression parameters can be made from this asymptotic distribution or 
from the likelihood ratio statistic given in Srivastava (1985). 


4.2 Incomplete Explanatory Variables 


In Section 3.1, the design matrices were assumed to be known completely. In some in- 
stances the explanatory variables can also be incomplete. If the explanatory variables are 
random, then these missing values can first be imputed for the explanatory variables given 
the observed data, using the procedure of Section 2 . Once imputed values for the explanatory 
variables are obtained then the method of Section 3.1 can be applied to estimate the regres- 
sion parameters and to impute the missing response variables. 


4.3. The Likelihood Ratio Test. 


The likelihood ratio procedure can be used to determine if the variables in the model are 
significant. To test the hypothesis 


A: B= 6iF vs Anbo* BiF, 
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for Fanm x q matrix of full rank, the estimates of © are obtained under the null hypothesis 
(Z) and under the alternate hypothesis (£). The null hypothesis is rejected at the a level 
of significance if 


2 
—2fn IN > X(q—m)p; a> 


where 
K ~ 
N= [J] | Bre]? / | B xB, |"*”. (14) 
K=1 


5. ESTIMATING A CHANGE POINT 


Consider a sequence of observations y;, j = 1, ..., N, with expected values E(y;) = pj. 
Srivastava and Worsley (1986) have given a procedure for estimating the point of change 
of the mean vectors yj. It is first assumed that the change occurs at some point r. Then the 
following hypothesis is tested. 


EL Stier n tie Oban 


A: py =... = Bp A bry = -- — BN 
The likelihood ratio statistic is then calculated as \,, for r = 1, ..., N —1. The estimated 
point of change is that value of r that yields the maximum value of 3,. 

The existence of incomplete data poses no problems for estimating the change point. The 
linear model is set up as for the complete data case, then the observations are grouped into 
the K subsets. Suppose that the observed portion of y; is z,;. Then under the alternate 
hypothesis for a given r, L the estimate for L is given from (3) for the regression model defined 
in (9)-(12), where the parameter matrix @ is defined as 


B= (M1, H2) 


and the design matrix for the k-th subset is defined by 


Ne stoe eh Vee ore 10) 
Ax = ; 
One OF lot getlh 


where the i-th column of A, has a one in the first row if observation z,; corresponds to the 
vector y; andj < rand zero otherwise. Under the null hypothesis the population mean vec- 
tor is considered the same for all N observations; hence, L the estimate for L is given from 
(2) and (3) for the one population mean problem. The likelihood ratio statistic is obtained 
from (14). 

Modifications of this procedure are possible. For example the vectors y; forj = 1, ..., N 
could be sample means for N sampling time points. Multiple change points can be eee 
by repeating the procedure on each section of the data. For 50 observations, if the change 
point occurs at point 20 then the procedure is repeated for points 1-20 and 21-50. 
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6. STRUCTURED COVARIANCE MATRICES 


For longitudinal studies the error vectors over time may not be arbitrary, but may follow - 
a time series model. If such a model can be assumed, then the number of parameters to be 
estimated is reduced. A stationary time series would assume that the covariance matrix L 
can be written as 


1 [Oi ee os ) MeGode Pp-1 
p} 1 pi. Pp-2 
ayer : : (15) 
(Os Wl IP  Pacessaecoc P| 1 


Further models can be obtained. The correlations o; can be structured. For example p; can 
be set equal to p|/!. The likelihood equations can be solved using the Newton-Raphson 
technique. Carter (1986) considered the case where the covariance matrix can be written as 
vec(L) = Gy for some matrix G. By defining y; = 0o’p; fori = 1, ..., p — land Yp = 0°, 
then the covariance matrix for the stationary time series can be expressed in this linearly 
restricted form. For example for p = 3 we have 


011 001 V1 
012 100 v2 
013 010 72 
021 100 
07 = 001 
073 100 
03] 010 
032 100 
033 001 


The estimate of L can be solved numerically from the likelihood equation G'H = 0, where 
H is defined in (3). Numerically the Newton-Raphson algorithm from Section 3 can be 
employed with the modification that the estimate for y at each iteration is given by 


WG OG +N) a GUvec(E):. 
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Statistical Editing and Imputation for 
Periodic Business Surveys 
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ABSTRACT 


For periodic business surveys which are conducted on a monthly, quarterly or annual basis, the data 
for responding units must be edited and the data for non-responding units must be imputed. This paper 
reports on methods which can be used for editing and imputing data. The editing is comprised of con- 
sistency and statistical edits. The imputation is done for both total non-response and partial non-response. 


KEY WORDS: Periodic survey; Statistical editing; Total/partial non-response; Imputation. 


1. INTRODUCTION 


Data are routinely collected by large organizations such as Statistics Canada based on pro- 
perly designed sample surveys. If such data are collected on a periodic basis from the same 
sampling unit, there are several possibilities which will occur with respect to the data con- 
sistency (quality) over a given time period. The sampling unit may report the data faithfully 
with no dramatic departure in continuity (“smoothness”) as time progresses. The data may 
be reported faithfully, with questionable jumps between two time periods. The sampling unit 
may not report all the requested data items: this is known as partial non-response. The sampling 
unit may report data sporadically with breaks of total non-response for some periods. These 
can occur simultaneously in a periodic survey which collects required data from a large number 
of sampling units. 

The problems which will be addressed in this article are the editing and imputation of data 
for sampling units that are contacted on a periodic basis by a surveying organization. The 
methods discussed are general for data of a multivariate nature composed of both quantitative 
and qualitative variables. The editing will include consistency and statistical edits. 

For quantitative data, consistency edits ensure that linear combination of the data fields 
within a given time period satisfy given requirements. For qualitative data, consistency edits 
ensure that variables correspond to well defined values. 

Statistical edits are used to isolate sampling units which may report some of their quan- 
titative data fields in an inconsistent manner either from time period to time period or within 
a specific time period. Units with unusually high or low values will be termed “outliers”. The 
identification of “outliers” is extremely important in an ongoing survey for two reasons. First, 
they influence statistics of the data set which may be for instance totals. This point has been 
studied by Hidiroglou and Srinath (1981). Second, the imputation of quantitative data for 
non-response units for periodic business surveys is usually based on trends or means: the 
removal of outlier units from the computation of these trends or means, will produce statistics 
that are not contaminated with there observations. For units which have partial non-response, 
data must be imputed for the missing fields. 

For large data sets, where timely release of the summary information is crucial, the editing 
and the imputation of data should be automatic and computer handled given some well specified 
rules. This is in agreement with Gentleman and Wilk (1975), and Fellegi and Holt (1976). 


1M.A. Hidiroglou and J.-M. Berthelot, Business Survey Methods Division, 11th Floor, R.H. Coats Building, 
Tunney’s Pasture, Ottawa, Ontario K1A OT6. 
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2. EDITING PERIODIC DATA 


2.0 Consistency Edits 


For a given unit 7 and time period ¢, let x;(t) represent the vector of data which is to be 
collected. The vector x;(t) may be decomposed into a series of elementary vectors for which 
independent editing and imputation are required. 


That is, XL) ene) eee (Nn) 
where x(t) i (x(t), tees xif) (t)) 
for P= asniinpeiimea Bata aaene 


and k, is the number of variables in the p:th elementary vector. 
For each elementary vector x;”)(t), the consistency edits may be represented as 


Al?) (x)? (t))' (CP?) 


where A”? is a f, by k, matrix representing the rules that the elements of the elementary 
vector x{”)(t) must obey, and c) is a 1 by &, vector which represents the constraints. This 
formulation allows one to define consistency edits for both qualitative and quantitative 
variables. For qualitative variables, the consistency edits could be used to check if the variables 
correspond to well-defined values. For quantitative variables, the consistency edits can check 
if certain variables are not larger (or smaller) than other variables or that a linear combina- 
tion is equal to (or greater than or less than) a given variable. 


2.1 Statistical Edits 


Given that data are reported periodically, the problem is to isolate outlying observations 
within the time series. In the present context, an outlying observation /, will be defined as 
one whose trend for the current period to a previous period, for given variables of the ele- 
ment vector x;(¢), differs significantly from the corresponding overall trend of other obser- 
vations belonging to the same subset of the population. Statistical edits can also be applied 
within a time period, by comparing the ratios of two correlated variables amongst themselves, 
within a given subset of the population. In this article, the statistical edit will only be discussed 
in terms of the trend between time periods. Similar, somewhat imprecise but working defini- 
tions of outliers have also been given by other authors, for example: 


GRUBBS (1969) says that ‘‘An outlying observation, or outlier, is one that appears to deviate 
markedly from the other members of the sample in which it occurs.’’ 


GUMBEL (1960) says: ‘‘The outliers are values which seem either too large or too small 
as compared to the rest of the observations.’’ 


KENDALL and BUCKLAND (1957, p. 209), write: ‘‘In a sample of n observations it is 
possible for a limited number to be so far separated in value from the remainder that 
they give rise to the question whether they are from a different population, or that the 
sampling technique is at fault. Such values are called outliers. Tests are available to ascer- 
tain whether they can be accepted as homogeneous with the rest of the sample.”’ 
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2.1.1 Review of Some Methods Currently Used 


Methods for detecting outliers have been proposed by Dixon (1953), Grubbs (1969), Tietgen 
and Moore (1972), and Prescott (1978) to mention a few. Most of the test procedures for 
outlier detection proposed by these authors consider the problem as one of hypothesis testing. 
In the simplest cases, the null hypothesis is that the sample comes from a normal distribu- 
tion with unspecified mean and variance, while the alternative hypothesis is that one or more 
of the observations come from a different distribution. Percentage points of a test statistic 
may be determined under the null hypothesis and compared with computed values of the 
test statistic in particular applications. Applying these methods to periodic data from large 
surveys presents problems for the following reasons. First, the assumption of normality of 
trends from one period to another may not hold. Second, these traditional methods require 
the existence of tables for determining critical values which define rejection regions. The 
method which we will propose in Section 2.1.2 does not have the above mentioned disadvan- 
tages. It can be easily implemented on the computer, does not require the assumption of 
normality, and does not make use of tables. 

In our specific context, and given elements of the vectors x;(t) and x;(¢ + 1), denote 
as x;(¢) and x;(t + 1) the responses for two consecutive periods for a given unit, where 
i=1, ...., n. Denote as r; the ratio of current period data to previous period data. One 
method which is known as the range edit, is to simply define fixed upper and lower bounds 
based on experience for comparison purposes. Ratios found outside these bounds are declared 
as outliers. A major drawback with this method is that the definition of outlier is too subjec- 
tive and does not make use of the distribution of the ratios. 

A method that attempts to make use of the distribution of the ratios is the Chebychev ine- 
quality edit. This edit is constructed by computing the lower bound as F — Ks, and the upper 
bound’as'7* +" ks, where 7 =" 72) rj/nand s7°= 52, (ior) ?/r = 1). ‘This’ edit: has 
two main drawbacks. First, the choice of k is subjective and can result in having an edit that 
cannot detect any outliers. This last point has been demonstrated by Wilkinson (1982). Se- 
cond, “‘large’’ outliers may hide ‘‘smaller’’ outliers. This effect is known as the masking effect. 

An improvement to this method has been the use of quartiles and interquartile distances 
rather than the use of mean and standard error to come up with the upper and lower bounds. 
In this case, the edit is constructed by computing the lower bound as IM — k Dro, and the up 
per bound as ry + k D, 3 Where rj is the median of the ratios, D, is the distance between 
the first quartile and the median, and D, : is the distance between the third quartile and the 
median. Since the quartiles are not affected by the tails of the distribution, it greatly alleviates 
the masking effect problem. However, this method has two drawbacks. First, in some very specific 
circumstances, it is possible that the outliers on the left tail of the distribution are undetectable. 
Second this method does not take into account the fact that in most of the periodic business 
surveys, the variability of ratios for small businesses is larger than the variability of ratios for 
large businesses (Sugavanam 1983). This fact is expressed. by the following graph: 


Quartile edit 
boundary for 
ratios 
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This drawback has the effect of identifying too many small units as outliers and not enough 
large units. This effect will be referred to as the ‘‘size masking eirect”-. 


2.1.2 Proposed Procedure 


For two occasions ¢ and ¢ + 1, the overall trend for the data pair given by 


od A) Pha. 4 W hens ee) Jn Wale eel bse 


R= YY x(t + 1)/ Ne): 
PLu 


i=1 


Now, R may be expressed as 


Rea y hn 
Le 


where I; = x;(t)/ yi x;(t) 


i=1 


and i= x;(t SF bX; oye 


I; is a measure of the relative importance of the i:th unit amongst the n units at time ¢. The 
individual trends r; must be transformed in order to ensure that outliers are detected at both 
tails of the distribution. This transformation is: 


1 =< Tag ly 1 Oe<on he acne 
Sa 
TENG 1, if 7; = Iu 


where ry is the median of the ratios. 


In order to bring in the magnitute of the data, the following transformation is required 
(Berthelot 1983): 


E, = 8; (Max (x(t), x(t + 1))3% 


4 


where 0 < U < 1. The E£;,’s will be referred to as effects and the exponent U in the transfor- 
mation provides a control on the importance associated with the magnitude of the data. This 
transformation allows us to place more importance on a small change associated with a 
‘large’? unit as opposed to a large change associated with a ‘‘small’’ unit. The values of 
the median and quartiles as used by Sande (1981) will be applied to the transformed, EF;’s, 
in order to detect potential outliers. Denoting as Eg;, Ey and Eg; as the first quartile, the 
median and the third quartile respectively, define the following two deviations: 


doi = Max (Ey == Eo, |AEy, |), 


do3 = Max (E93 > Ew, |AEy |). 
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Outliers will be defined as all those units whose associated effect E; lies outside the inter- 
val (Ey — Cdo;, Ey + Cdo3). The purpose of the AE, term is to avoid difficulties which 
arise when Ey — Eo, or E93 — Ey are very small. That is, the problem which may arise 
when the effects E; are clustered around a single value with one or two modest deviations 
may produce false outliers. The parameter C controls the width of the acceptance interval. 
The parameter U controls the shape of the curve defining upper and lower boundaries. The 
effect of increasing U is to attach more importance with fluctuations associated with the 
larger observations. A value of 0.05 is suggested for A as it has proved to be adequate in 
practice. 


2.1.3. Treatment For Outliers 


Once units have been identified as possible outliers, they are flagged as such and 
brought to the attention of the survey takers. A decision must then be taken on how these 
abnormal observations are treated. Their existence may have arisen as a result of several 
factors. These factors include measurement error, incorrect interpretation of the question- 
naire by the responding unit, or intrinsic variability of the population being surveyed. 
For units which have measurement error due to incorrect transcription of the data or 
incorrect responses, a simple follow-up will clear up the majority of these errors. For 
units which display intrinsic variability as a result of rapid growth, the reported values 
are correct but dominate too much the resulting summary tables. For those units, techni- 
ques, which reduce the sampling weight as suggested by Hidiroglou and Srinath (1981) or 
change the values themselves as suggested by Ernst (1980), must be used in order to 
accomodate (minimize) the effect of outlying observations. For units having unrepresentative 
data which cannot be verified, their data must be substituted with other data based on im- 
putation techniques. The different kinds of corrective actions taken on outlying units must 
be flagged as well. 


3. IMPUTING PERIODIC DATA 


The information collected by periodic business surveys, such as sales and employment 
are collected via samples using mail questionnaires or telephone interviews. Non-responding 
units are followed up as much as possible within allotted budgets in order to improve 
the response rates. The follow-up is usually done by mail in the case of the smaller 
to medium sizes non-responding companies and by telephone for the larger or dominating 
companies. Although following up delinquent companies improves response rates for a 
given reference period, there will be nevertheless, a group of non-responding companies 
which may be classified into either hard-core or late respondents. Hard-core non-respondents 
are units which require a great deal of persuation to respond, if at all. Late respondents 
are units which respond late with respect to the survey’s reference period either because 
they do not mail back their questionnaire on time or because they need to be prompted 
by a follow-up questionnaire. The non-responding units must therefore be imputed in 
order to make up for their contribution to the particular estimator being used by the 
survey. In the case of Monthly Business Surveys, such as the Monthly Retail Trade 
Survey, totals (e.g., sales) are being estimated. Imputation procedures can also be used 
to generate values for units declared as outliers. These imputed values can be used in 
lieu of these outlying observations, if no valid explanation can be provided for their 
presence. 
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The units with no response whatsoever, will be termed as total non-respondents and those 
with some, but not all, required data items, will be termed partial non-respondents. Desirable 
features of an imputation system should include the following properties (Berthelot and 
Hidiroglou 1982): 


e it must automatically determine the most reasonable imputation procedure possible under 
the existing circumstances, 

¢ the imputation cell, the level at which the computation of trends and means (medians) is 
performed, will usually correspond to the finest level of stratification of the sample, 

e a minimum number of units must participate in the computation of trends or means (me- 
dians), otherwise, the imputation cells are automatically collapsed (using a pre-determined 
pattern), until the minimum requirement has been satisfied, 

e it will recognize through the use of status codes that there are units which must not be 
imputed. These include seasonal units during the period that they are not operating, units 
temporarily out of business, or units which are no longer active, 

e births which have no previous business history will have their data imputed using the means 
(medians) of similar responding births, 

° units will be re-imputed for a number of periods previous to the current period: this is 
done in order to improve the strength of the imputations if the previous periods have been 
updated with data, 

e backward imputations will be applied to units which have been continuously imputed using 
a forward imputation procedure as soon as a good response is obtained for a given period, 

e jmputation status codes will be associated with imputed units in order to provide a history 
of the procedure used for imputation, 

e the ranking for imputing non-responding units is as follows: trends (monthly, quarterly, 
annual), means (medians) with the most recent trends being given priority. For instance, 
in the case of a monthly system, monthly trends are used for units which have data (response 
or imputed) in the month prior to the one to be imputed. Annual trends are used mostly 
for units which are seasonal and which fail to provide a response as they emerge from 
their out of season period and for which a last year value existed for the month to be im- 
puted. Imputations based on the trends are obtained by multiplying the trends by the unit’s 
last month or last year value. In the event that trends cannot be applied, the mean (me- 
dian) of the cell is used as an imputation. 


In order to formalize the preceding paragraphs in a mathematical fashion, let the number 
of units which are expected to respond for a given cell and given month be n. Let the number 
of non-respondents with total non-response be 73, the number of respondents with total 
response be n, and the number of respondents with partial response be np. It is assumed 
that the sample design is stratified with the sampling being simple random without replace- 
ment. Let the size for the follow-up sample of the non-respondents be m3 (2 S m3 S n3, 
with m3 having been selected from n3 according to a randomized mechanism). Note that 
ng = n — L3_, n; units are not expected to provide any response to the survey process for 
a number of possible reasons. At a time ¢, they may be out of season, inactive, dead, or 
out of scope to the survey. For these units, the system will automatically associate zero values 
for all relevant fields in the given period. 

The imputation process will then be done in several different ways according to the type 
of non-response. 


3.0 Total Non-Response 


The imputation process for the total non-respondents will first be discussed. Bearing in 
mind that either the whole vector x;(t) or that some of its elementary vectors as given in 
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Section 2.0 must be totally imputed, denote as (x;,(f), ..., Xip(t)) one of the elementary 
vector within x;(t) where the editing and imputation process is independent from other 
elementary vectors within x;(t). Assuming that 


p-1 
Nip (Qo zail Voir WO) 
j=1 


(which implies that the sum of the first p—1 data elements of the elementary vectors are 
smaller than the p:th datum element, the total) Xip(t) will first be imputed as 


6 
Fy Ol ei, (eo 
k=1 


where 6;*) refers to the procedure used for imputation and ra is the associated imputed 
value. One of the six 6{) values will be one and the other five must be zero (LS_, 
6/9 = 1). The imputed z‘(¢) values will be as follows: 


Zip (t) =L) we xp(t)/ YW, x p(t-1)] Xp (t—-1), 
TES} TES] 
zp (t) =[)) WX p(t)/ Vw, Xp(t—O)] Xp(t-Q), 


res? TES? 


2") (t) = LY) we Xp (t)/ YY Wy Xpp(t-1)] xp(t-1), 


res3 res3 


Zp (1) = LD We Xp t)/ YY We Xip(t—O)] Xp(t— O), 
TESA TeS4 

Cae) al yy Wy Xrp(t)/ ay w,], 
TESS ress5 


z (9) (t) =[ De Wak) ay wl, 


reS6 TES6 


w, = inverse selection probability of unit r for the given cell. The subsets s; (i=1, ..., 6), 
will be determined by selecting the units which have provided a response for the p:th variable 
at time ¢ and which have passed the edits. The conditions for each subset is 


oo ad 


s; = all units which have provided edited responses between times ¢ and ¢—1, 


i 


Sy = all units which have provided edited responses between times ¢ and t—Q, 


S3 = units in the follow-up subsample which have provided edited responses between 
times ¢ and t—1, 


80 


Sse 
ye = 


4 thave 
(i) 


(il) 


(iii) 


(iv) 
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— units in the follow-up subsample which have provided edited responses between times 


t and t-—Q, 
all units which have provided edited responses at time ¢, 
units in the follow-up subsample which have provided edited responses at time f. 


choice of the imputation procedure will be governed by the following considerations. 


Procedures 1 (or 2) will be used if there is a response or imputed value at time ¢—1 
(or t—Q) and that it is believed that the trends for the non-respondents is the same 
as the one for the respondents, within the given cell, 

Procedures 3 (or 4) will be used if there is a response or imputed value at time ¢—1 
(or t—Q) and that it is believed that the trends for the non-respondents differs from 
the one for the respondents within the given cell. 

Procedure 5 will be used if there is no response at either times ¢—1 or t—Q and that 
is believed that the mean of the non-respondents is equal to the mean of the respondents 
within the given cell, 

Finally, procedure 6 will be used if there is no response at either times t—lort—Q 
and that it is believed that the means of the respondents and non-respondents are different. 


The choices between the different procedures can be made using decision tables which 
determine the conditions and, given the condition, choose the best imputation procedure 
according to pre-determined rules. Once that x;,(t) has been imputed for an elementary vec- 
tor, its remaining components can be imputed using the procedures for partial non-response. 


3.1 Partial Non-Response 


For an elementary vector (x; (t), Xj2(¢), ...» Xip(¢)) which is part of x;(7), let 6, be the 
indicator variable which is equal to 1 if x;(t) is present and zero otherwise at time ¢. Some 
additional notation is introduced at this point in order to ease the development. To this end, 


define 
p-1 
Sir(t—-1) = J) by xy(t-1) 
j=] 
= the sum of responses at time f—1, for which 
there is a response at time f 
p= 
Sinr(t-1) =" Y) (1-84) xy(t-1) 
1 
= the sum of responses at time ¢—1, for which 
there is no response at time ¢, 
p-1 
Sir(t) = Ly, Opa 
j=1 
The partial imputation will be based on the assumptions that x;,(¢) = yeah x(t) and that 


the distribution of the elements within x;(f) is similar to the distribution of the elements 
within x;(t—1). Two separate cases will be discussed. 
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Case 1: Parts of the elementary vector missing and Xip(t) present 


DN jo) 
Two subcases are possible: x;,(¢) = 3 Xi (Ey OL X(t) = De bela e 
et j=! j=l 
(i) Xp(t) = 2 x(t) 
j=l 


If all the elements of x;(t) excluding Xjp(t) are missing, that is ee 6; = 0, then we must 


have that s;vr(t) = xjp(t). If some of the elements of x;(t) excluding Xjp(t) are missing, 
that is pasa bi > 0, then Si,NR(t) = Xjp (1) —— Spt): 


p-1 
(ii) Xip (1) > ‘Ss xj(t) 

j=l 

If all the elements of x,;(t) excluding Xiy(t) are missing, then s;yr(t) = Sinr(t—1) 

Xip (1) /Xip(t—1). If some of the elements of x;(t) excluding Xjp(t) are missing, the choice 
Of S; yr(t) is not so obvious. In any event, one must have that SiR (LE) PS; NRE) 2%), (1): 
To this end, four separate possible imputations for Si nr(t) will be given in order of 
preference. 


(a) Sinr(t) = [Sinr(t—-1) + Siz(t—-1)] Xip(t)/Xip(t-1) — 5;,.2(t) provided that 
Si,nr(t) = 0. Note that the condition x;, > Z?=} X(t) is met if s;yr(t) = 0. 
(b) Sinr(t) = Sinr(t—1) [S,.r(t)/5;,R2(t-1)] 


(C) Sinr(t) = Sinr(t—-1) [Xip (2) /Xp(t—-1)] 


(d) S,nr(t) = X(t) — Sir(t). 


The preferred imputation will be the first one that does not violate the inequality condition. 
For all the above cases, the imputed (actual values) will then be 


EP EY = 64) 1SENR CD) Sine (C= 1) 19) (F211) 


+ O7-Xy (OPH Tl, UG pl 


Case 2: Parts of the elementary vector missing and Xip(t) is missing 


As in case 1, two subcases are possible: 

p-1| 
(@) Xip(t) = SY) xj (t) 

pa 
If D%=) 6; = 0, then s;ye(t) = I{})(t) where /{(t) has been obtained using the imputa- 
tion for total non-response. The imputation /{(t) is then used. If L2=} 6, > 0, 12) 
will be used provided that s;yr(t) = I(t) — s,z(t) = 0. Otherwise, the following im- 
putation must be used 


T(t) = (1-6y) USi,na(t)/Si,ne(t—1)] x(t-1) 


TiO y— hen p 1 
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and I{})(t) is replaced by ID (Ons DE AGO 
(il) jE) rae) 


For this case, the x;)(t) in case 1(ii) is replaced by TS) (t) and the methods given for this 
case are used, provided that the above inequality condition is satisfied. If the condition 
cannot be met, /{3)(¢) must be used and T‘))(t) is replaced by FGliesor ery 

If the assumption, that the distributions of the data elements of vectors x;() and x;(t—1) 
is similar, does not hold, then each individual element must be imputed using procedures 
for imputation for total non-response. These imputations must then be adjusted in order 
to satisfy the inequality requirement x;, = Leap x;;. Hence, for example, for case 1(i), we 
would have for L4=/ 6, = 0, 


p-l 
P= IxpOL YS LROLLLO 
j=! 


and for L271 6; > 0 


Xn Ce eee rp RD) 
r=! (1-64) Tht) 


TCA 107) | +p xg(Hsy = Iw pe kw 


Similarly, cases 1(ii) and 2, could be developed using the imputed values TVD): 


4. CONCLUSION 


For periodic business surveys, it is important to have computer systems which can quick- 
ly and accurately monitor the flow of in-coming data in terms of its quality. Conversely, 
for expected data that are not coming in, the system should impute as well as possible for 
the non-response given some well specified rules. 

The editing will cause the flagging of records in possible error. These errors can be term- 
ed as critical and non-critical. All errors should be corrected by either reviewing the ques- 
tionnaires or checking their authenticity with the respondent. If this is not possible on account 
of time or budgetary constraints, the most critical errors must be corrected. Given that the 
errors have been taken care of, the next step of the processing is to impute for the non- 
respondents. Diagnostic summaries of the actions (edits or imputations) taken by the system, 
should be printed out in order to inform the survey analyst on the status of his data. 
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Practical Criteria for Definition of Weighting Classes 


VICTOR TREMBLAY! 


ABSTRACT 


When the technique of adjustment using weighting classes is applied to compensate for the effect of 
non-response, several questions arise that call for precise and quantified answers: How does the choice 
of the variables used for definition of the classes affect total root-mean-square error, in particular non- 
response bias and sampling variance? What rule and what procedure should be followed in choosing 
the adjustment variables? On the basis of what criterion can the optimal sizes for the weighting classes 
be established? Finally, when this procedure is applied to compensate for non-response with respect 
to specific elements of a questionnaire, how can strongly correlated ancillary variables be used effec- 
tively when they themselves are affected by non-response? This article is addressed to those profes- 
sionals working at a practical level who are seeking guidelines. 


KEY WORDS: Adjustment for non-response; Weighting classes; Poststratification; Non-response bias. 


1. INTRODUCTION 


The problem of adjustment for non-response through creation of weighting classes is clearly 
related to that of determination of poststratification criteria. Kish (1978) stated that there 
was an urgent need for research in this area, noting that, in terms of advantages and disad- 
vantages, the final effect of this type of weighting is often unknown. At the same time, Platek, 
Singh and Tremblay (1978) developed mathematical expressions for the bias and the variance 
of the estimators resulting from adjustment using weighting classes. Their model, which was 
based on the response-probability concept, was developed further recently by Platek and 
Gray (1983). During the same period, Bailar, Bailey and Corby (1978) described the theoretical 
and empirical research undertaken at the US Bureau of the Census. They end their presenta- 
tion by emphasizing the importance and the necessity of developing solid theoretical foun- 
dations for the methods of adjustment for non-response. More recently, the Panel on 
Incomplete Data (1983) provided a particularly concise and complete description of the prac- 
tical implications of adjustment through weighting and stressed the conclusions reached by 
Oh and Scheuren (1983) following a simulation study. Chapman (1983) analysed a number 
of procedures that could be used to identify the most relevant variables for effective cons- 
truction of weighting classes. 

This article continues along the same lines as these research efforts by attempting to define 
some rules for application of this adjustment procedure starting from theoretical founda- 
tions. The single example used for illustration throughout this text is very specific, but the 
reader will no doubt be able to identify much more varied and rich application possibilities. 


2. ILLUSTRATION OF THE TECHNIQUE 
Let us take as our example the measurement of voters’ intentions, a very real and fre- 
quently encountered problem. All of the data used in this text comes from the fall 1985 
OMNIBUS survey of the Survey Research Centre at the University of Montreal. One section 


of this survey was aimed at measuring voters’ intentions four weeks before the December 
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Table 1 


Distribution of Voting Intentions 
(with Non-Response) 


ne % 
Parti Québécois (PQ) 505 275 
Quebec Liberal Party (QLP) 650 35.4 
Other parties 62 3.4 
Non-response 619 334) 


TOTAL 1,836 100.0 


4 Number of weighted cases. 


Table 2 


Satisfaction with the Quebec Government and Voting Intentions 
with Regard to the Provincial Election 


Voting Satisfied Dissatisfied 
intentions (n™=" 555) (1 ="656) 
PQ 70.1% 17.3% 
OLE 26.7% 76.1% 
Other 3.2% 6.6% 


1985 Quebec elections. The responses to the question regarding voting intentions given by 
the 1,836 individuals surveyed who intended to vote were distributed as in Table 1. 

This table presents a situation where the response problem obviously cannot be ignored. 
Blindly distributing the non-responses in proportion to the other responses is a risky approach 
based on the supposition that those who did not express their voting intentions have the same 
profile as those who answered the question spontaneously. 

The two consequences of such a high incidence of non-response are well known: potential 
bias and an increase in sampling error following effective reduction of sample size. Any ad- 
justment technique must be aimed at reducing these two effects. When, as in this case, a 
high incidence of non-response can be foreseen, it is appropriate to include in the question- 
naire correlated questions that can be used as a basis for eventual adjustments. For example, 
it may be very useful to ask the persons surveyed whether or not they are satisfied with the 
current government, given the close connection between this index and voting intentions, 
as shown in the following table. 

As Table 2 shows, 70.1% of those satisfied with the government intended to support the 
party in power (the PQ). However, as might be expected, the situation was reversed among 
those who were dissatisfied: 76.1% of this number intended to vote for the QLP, which was 
the opposition party at the time. 
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Table 3 
Satisfaction with the Government Cross-Classified with Whether Or Not an Answer Was 
Given to the Question Regarding Voting Intentions (Number of Weighted Cases) 


Satisfied Dissatisfied TOTAL 

Answer given to question 

regarding voting in- 

tentions ny = 555 Ny = 656 n = 1,211 
No answer given to 

question regarding 

voting intentions 236 334 570 
TOTAL hi 94 nz = 989 oe Fi OU: 


* This table excludes 56 nonresponses to the question on the satisfaction. 


One of the techniques available for using this ancillary information is the creation of 
weighting classes based on satisfaction. Table 3 presents the complementary data required 
for making the adjustments. 

If those who were satisfied and those who were dissatisfied are regarded as two weighting 
classes, statistical adjustment of the data takes the following form: 


if pj;- = the proportion of respondents in class c who intend to support party /; 
n, = the number of persons in class c who answered the question regarding 
voting intentions; 
n = &. n, = the size of subsample S, of those who answered the questions regarding 
voting intentions and satisfaction; 
n. = the total number of persons in class c; 
and n’ = Xn; = the size of sample S of those who answered the question regarding 
satisfaction 
The adjusted estimates of voting intentions are then calculated as follows: 


pj = (/n) YY ne Dje- 
Cc 


This new estimate corresponds to introducing a corrective weight equal to nin/n,n' for 
all respondents in class c. 

This simple exercise illustrates the functioning of the well-known mechanism of statistical 
adjustment through construction of weighting classes based on traditional poststratification 
procedures. The questions which must be gone into in more depth for such an application 
are as follows: 


1. What is the impact of this procedure on reduction of non-response bias? 

2. How does this technique affect sampling error? 

3. What are the best ancillary variables (or combinations or variables) for definition of the 
classes? 

4. Up to what point is it advantageous to refine definition of the weighting classes? 

5. What should be done with ancillary variables that also involve non-response? 


To answer these questions properly, we must continue to develop the theoretical founda- 
tions for application of weighting classes. 
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3. IMPACT OF ADJUSTMENT PROCEDURE ON NON-RESPONSE BIAS 


The most difficult challenge with respect to non-response is that of quantifying reduction 
of non-response bias following application of a given technique. If this challenge could be 
met, it would be possible to mesure the bias and, consequently, to produce unbiassed estimates. 

However, we can still endeavour to understand more fully the mechanisms underlying 
non-response, in order to design instruments that would reduce as much as possible the im- 
pact of non-response on data quality. 

One way of studying the problem is to consider it from the angle of response-probability 
theory, according to which we would stipulate that, for each unit U; of the population, the 
probability of responding to the survey (or to a specific question asked) is a; if that unit 
is selected. Even though this approach calls for the supposition that the a’s are not nil, the 
theory allows us to infer mathematical expressions for non-response bias with the applica- 
tion of a given method, in function of the observations X; that we want to obtain and of 
the response probabilities a. This was the approach taken by Platek and Gray (1983); for 
estimating subtotal in weighting class c, by adjusting the sampling estimation using the in- 
verse of the response rate in class c, they established that residual non-response bias could 
be expressed as follows: 


Ne Ne 
BUX) =e 3 (ay = &) XG where & = N,! Sa OL, (3.1) 
ja | rN 


and where N, = the size of class c of the population. 

Expression (3.1) reminds us that residual non-response bias exists following application 
of the correction factor only if, within class c, there is a correlation between the response 
probabilities and the characteristic measured. 

Moreover, it is interesting to examine expression (3.1) in the special context of classifica- 
tion data-that is, where the X; = 0 or 1. Using the notation introduced in the preceding 
section, it can be shown that the residual bias of X, following application of the correction 
factor can be written on the basis of expression (3.1) in the following form: 


B(X,) = N.P.o, (ae — O,) 
= N.P.(1 — P.) ac! (ax — a); 


where P, = the real proportion of the units in class c that have characteristic X; 


ony 


a@~* = the average for response probabilities among the units in class c that have 
characteristic X; 


and a@* = the average for response probabilities among the units in class c that do 
not have characteristic X. 
It is useful to reformulate B (X,) as follows: 


B(X,) = Nog d.(X, X) 
where o7 = is the variance of characteristic X within class c 


and d.(X, X) = a '(az - x) is a standardized measurement of the distance bet 
ween the average response probability for those who have characteristic 
X and that for those who do not have it within class c. 


Survey Methodology, June 1986 89 


The non-response bias associated with estimation p’ of P can therefore be expressed as: 


BU pt) =BEN —& YuX)) 


cS 


= N7' YY Nod. (X, X), (3.2) 


Expression (3.2) provides a mathematical argument in support of the thesis frequently 
put forward that it is advantageous to construct categories that are as homogeneous as possible 
with respect to the phenomenon studied by partitioning the sample into segments, some of 
which tend to contain units with characteristic X, and some of which do not. 


4. IMPACT OF ADJUSTMENT PROCEDURE ON SAMPLING ERROR 


As you know, one of the consequences of the non-response problem is an increasing of 
random sampling error following reduction in the number of observations. It is revealing 
to examine to what extent the adjustment technique discussed here compensates for this loss 
of precision. A number of the authors referred to in the introduction have pointed out the 
potential danger in having corrective weights that are too large or too unstable, being based 
on a number of observations within a class that is too limited. Platek and Gray (1983) presented 
an approximate expression for the component of variance attributable to non-response follow- 
ing adjustment. 

Although it is instructive regarding the general behaviour of this component of sampling 
variance, this mathematical development does not reveal the critical point beyond which refine- 
ment of the weighting classes adversely affects data accuracy. 

In reality, we find ourselves in the following situation. The person conducting the survey 
has some reliable information with respect to a representative sample of the population be- 
ing studied (for example, information regarding satisfaction with the government), but the 
data that interest him or her most for purposes of the survey (for example, information regar- 
ding voting intentions) are available only for a subsample, and he or she would like to use 
certain data from the base sample to improve the accuracy of the estimators. Whether we 
are talking about non-response at the level of the sampled units or about non-response at 
the level of specific questions in a questionnaire, the fundamental problem is the same. From 
the point of view of estimator variance, there is some analogy with double sampling, where 
data adjustment corresponds to application of the separate-ratio estimators-that is, to 
poststratification using categories definable on the basis of information available in the base 
sample. Of course, this analogy is unacceptable as far as analysis of the biassing effect of 
non-response is concerned, since one cannot support the hypothesis that the subsample of 
the respondents is probabilistically representative of the base sample. However, for purposes 
of studying estimator variance, the analogical approach is as useful as it is defendable. 

More specifically, imagine the following situation. A simple random sample S of size n' 
gives us the distribution of a classification variable for the total population, with N: 

= (N/n')n,' as the estimator of the number of units of the population belonging to class 
c. A simple random sample S, C S of size n=fn' (0<f<1) is chosen to measure the 
distribution of another classification variable X. For each of the units of S, we know the 
classification on the basis of the two variables described above. 

We want to estimate the proportion Pj; of units belonging to class j of variable X. The 
simple estimator inferred from S, is 


DA= (1/A) ys 2% ne ni. 
Cc 
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Moreover, the separate-ratio (poststratified) estimator can be expressed as follows: 


py = (/n') VY neve 
c 


While all the units of sample S, are given a weight equal to | in expression p;, we can 
see that, in expression for De the weight of the units varies, depending on the c class to 
which they belong. These ‘‘corrective’’ weights equal to n{ n/n, n' use the complementary 
information available with respect to sample S as a whole for division into classes. 

According to Tremblay (1975), if the formula for the variance of p; is developed, keep- 
ing the terms to the size of the relative variance of the N., the following is obtained: 


Var p; = Var pl —f\/nL Lye, Pb) rere) el (4.1) 


Cc 


where r, = N(1 —P.)/nN-: the relative variance of the N, estimator that is,V. = (N/n)n¢ 


P;, = N,/N: the proportion of the population belonging to class c; 


Pic= Epje’ the proportion of the units that have characteristic j in class c; 


P; = Ep;: the proportion of the units of the population that have characteristic /. 


Equation (4.1) shows that the effectiveness of the technique of adjustment using weighting 
classes increases as interclass variance increases and, consequently, as intraclass variance 
decreases. It is easy to verify that, in the extreme case where there is maximal interclass variance 
- that is, where all of the P;, are either 0 or 1: 


Wale; = va = 1 Sag ALOR 
that is, the variance that would have been obtained if all of the n’ units had responded. 


In addition, equation (4.1) reminds us that, in so far as the relative variances are negligi- 
ble with respect to 1, it is advantageous to refine the partitioning, dividing the sample into 
a large number of classes. We thereby increase interclass variation and, by the same token, 
reduce the variance of D;- 

However, refinement of the partitioning is limited by the presence of relative variances 
r.. We should look at this situation a little more closely. Let us postulate that a first parti- 
tioning of the sample into a group of classes C’ produces estimator Dj as previously defin- 
ed. Then let us postulate that a second, more refined partitioning C” allows for the 
construction of estimator Dj. If all of the classes coincide with the classes, except for one 
c class divided into two parts (c, and c, that is, c=c,Uc,), it would be interesting to find 
a simple criterion for determining which of the two partitions (C’ or C”) produces the smallest 
variance, taking into account the r, factors in expression (4.1) above. We know that: 


Var p'*<)'Var'p’ 


cid eet 2 3 Ua core) a aah pan Ua Mer 
ceCe ceC’ 


Swe rePd (1 SPB RISO AP Pg) Po: 
ceC” ceC' 
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The left-hand member G of the inequality can be expressed thus: 


CLES TA DIES aN oat 
ceC” ceC’ 
= Pie Po + PheyPo — P2Pe. (4.2) 


If class c has been partitioned in the following way: 
Ne, = an, and n,, = (1—a)n, when 0<a<1 


we know that P;, = aPj-, + (1—-a@)P;,., that P.; = aP,, and, finally, that 
P., = (1—a)P,. Expression (4.2) can therefore be written compactly as follows: 


G = P.a(1—a) (Pie, — Piel. 


i 


Moreover, the right-hand member can be reduced to: 


D = Veda) Lc, ar lone LS) bob Al PP. 


With respect to the relevance of refining the partitioning, by replacing relative variances 
r, with the expression previously established, noting that the terms P.,, Pc, and P, are 
negligible with respect to 1, we obtain: 


emma st yen (le) cee (le) 


Because of the convexity of function P(1—/P) and the fact that P is a linear combination 
Pj-, and P;-,, the value of D is limited in an upwards direction by 1/4n. Thus, for the 
variance of p” to be smaller than that of p’, it is sufficient that: 


Pe a(d =a) \[Py= Pg) 2 W/4n: 


If P. is estimated using n,/n on the basis of subsample S,, this condition takes the follow- 
ing form: 


DIF =| Pie, — Pjc| > YxVa(1 — a)n. = DIFMIN (4.3) 


Inequality (4.3) therefore reveals a simple rule that is sufficient to make it advantageous 
to divide class c into c, Uc). As we might have expected intuitively, the larger the number 
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Table 4. 
Values of DIFMIN = 2Va(1 — a)n; (in %) 


n a= 2 a=1/4 TW 10 
1000 3.1% 3.7% 5.3% 
400 5.0% 5.8% 8.3% 
200 7.1% 8.2% 11.8% 
100 10.0% 11.5% 16.7% 
80 11.2% 12.9% 18.6% 
60 12.9% 14.9% 21.5% 
40 15.8% 18.3% 26.4% 
20 22.4% 25.8% 37.3% 
15 25.8% 29.8% 43.0% 
10 31.6% 36.5% 52.7% 


of respondents in class c (in sample S,) or the greater the difference between the P;,, and 


icy 
Pic, proportions, the more advantageous it is to refine the partitioning of the classes. Table 


4 above presents the minimal differences (DIFMIN) corresponding to various values of n, 
and a. 


The above table tells us that, for example, if we have a class containing 100 respondents 
which we are considering dividing into two more or less equal parts, there must be a dif- 
ference of at least 10% between the two new classes where the j characteristic is concerned 
if the refinement of the partitioning is to help reduce sampling error. If there is less than 
a 10% difference between the two, refinement will serve no purpose, and may even increase 
the variability of the estimates produced. Moreover, we can see that if subclasses c, and c, 
are very unequal, the requirement regarding differentiated behaviour of their respondents 
with respect to characteristic j (that is Pj, vS Pic) is stronger. Thus, if c,; represents approx- 
imately 10% of c, the minimal difference (DIFMIN) is 16.7%. 

In the specific case where class c is divided more or less equally between c; and cp, the 
minimal difference (DIFMIN) can be expressed very compactly: 


DIFMIN = 1/ Vn, 


In situations where class c is divided into several components (c=c,Uc,U...Uc,), we can 
apply the test described here, considering the smallest of subclasses c; on the one hand, and 
all of the rest on the other. Since, in this case, a (or 1-a) may be small, we can simplify the 
rule expressed by inequality (4.3) and consider the minimal difference as follows: 


DIFMIN = Y2Nmin (n,;) 
j 
It should be noted here that these results were developed by analogy in the context of 
sampling in two phases, and that the rules which have been arrived at may apply both to 
separate-ratio estimators and to poststratified estimators. For example, it is often useful to 
determine up to what point refinement of a poststratification produces more precise results. 
The rules set out here may therefore serve as a guide. 
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5. CRITERION FOR CHOOSING ADJUSTMENT VARIABLES 


Looking once more at the survey of voters’ intentions, we see that the degree of satisfac- 
tion with the government can certainly serve as an adjustment variable for non-response with 
respect to the question regarding voting intentions. However, is this really the best variable 
we could use? If the survey instrument contains other questions connected indirectly with 
voting intentions, on the basis of what criterion can we choose between, for example, satisfac- 
tion, certain sociodemographic profiles (language, education) and the perception as to who 
would make the best premier? 

The two preceding sections show us that the more homogeneous the constructed classes 
are, the more variance of the adjusted estimates is reduced and the more likely it is that the 
bias itself will be smaller. It is therefore advantageous to create classes that maximize in- 
terclass variance of estimator p;. With respect to algebraic expression (4.1), the partition- 
ing chosen must maximize the quantity 


INTERCL; = Y) (Pic — Pj)°P- 
C: 


For a multinomial variable X with parameters P,,P>, ... ,P), the problem is finding a 
statistic that incorporates all of the INTERCL quantities (j = 1, ... ,J). In this case, x7 
merits consideration, since 


Xe SN od Pies B7Pe/ P= NY PUINTERCL;/P}) 
J c J 


In other words, x? is equal to a linear combination of the relative values of the INTERCL is 
weighted in function of the P;’s. On the other hand, since B; = INTERCT PA —<P)) 
measures the proportion of the variance explained by division into classes, there is also 
justification for considering the statistic 


YB= 4 Y %& — P)?P/Pi(1 — P)). 


tl 


Note that the latter statistic is equivalent to x’ in three specific situations: a) when_X is 
dichotomous; b) when the P;’s are almost equal; and c) when the P;’s are all small. In the 
multinomial case, where it is important to refine estimation of a P for a particular j index, 
we can therefore dichotomize variable X in function of this j index and use x? as a perfor- 
mance criterion for division into classes. For our example, we will use x7, since this statistic 
is produced directly by most of the software used for processing survey data. 


6. APPLICATION AND INHERENT PROBLEMS 


In the preceding discussion, we found a criterion for evaluating the performance of 
weighting classes. In practice, however, variables which best explain variance may also be 
affected by the non-response problem. This complicates the choosing of weighting classes 
to some extent. 

The following table presents a list of variables deemed interesting a priori by a researcher 
for the purpose of weighting to adjust for non-response with respect to the voting-intentions 
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question. For each potentially useful variable, there is a description of the value of x, the 
number of missing values and the total number of missing values when the variable is cross- 
ed with the question on voting intentions. Remember that the latter question, taken alone, 
accounted for 619 non-responses in the survey. 

The value of x” is very revealing with respect to the predictive force of the different 
variables involved. For example, we can see that, among the sociodemographic variables, 
only mother tongue has an impact that merits attention. On the other hand, some thematic 
questions show an unequivocal link with voting intentions - in particular, that regarding 
degree of satisfaction with the present government and that which asks which of the two 
main party leaders would be the best premier. It is clear that the more a question is perceived 
as being connected with the basic question, the more difficult it is to obtain responses. Only 
56 non-responses were recorded for the more insignificant question regarding satisfaction 
with the government (approximately 3% of the sample), but there were 392 non-responses 
when people were asked who would be the best premier! 

In the creation of weighting classes, it is therefore advantageous to try to use variables 
strongly correlated with the phenomenon being studied, as well as variables which are both 
strongly correlated and characterized by an excellent response rate. In addition, by crossing 
the relevant variables with each other, we can create classes that are more homogeneous and, 
consequently, increase the value of y*. Obviously, the degree of refinement of the classes 
must be in line with the limiting criterion previously expressed by equation (4.3). 


Table 5 
List of Variables That Might Be Useful for Compensation, through Weighting, for 
the Effect of Non-Response with Respect to the Question on Voting Intentions 


Value Number of missing Number of missing 
Variable* of data on the data upon cross-classification 
x7 variable with voting intentions 

Age (6) 34 4 620 
Education (4) 8 3 621 
Mother tongue (2) 96 0 619 
Degree of satisfaction with 

Quebec government (4) 382 56 625 
Degree of satisfaction with 

Quebec government (2) 346 56 625 
Identification of best 

premier (3) Wks: 392 686 
Vote in 1981 provincial 

election 109 269 658 
Interest in politic 1 1 619 
Degree of satisfaction with 

federal government (4) 39 58 631 
Voting intentions at 

federal level (4) 288 694 832 


The figures in parentheses indicate the number of classes considered for the variables in question. 
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Consider, for example, the formation of weighting classes on the basis of three variables 
that are explanatory with respect to voting intentions - namely, identification of the party 
leader who would be the best premier (3 response categories), degree of satisfaction with 
the government (4 categories) and mother tongue (2 categories). At this stage, the idea is 
to project the voting intentions determined for the respondents in a given class onto all of 
the individuals in that class - that is, those for whom it has been possible to establish a 
classification. The first step in the process is to refine the classes as much as possible on 
the basis of the three variables involved and produce a cross tabulation of voting intentions 
in accordance with these twenty-four (3x4x2) classes. Referring either to the criterion reveal- 
ed in equation (4.3) or to Table 4, we eliminate through combination those classes which 
are too small. Where necessary, we therefore group together ‘‘similar’’ classes - that is, classes 
that have a similar voting-intentions profile. We are then in a position to produce a table 
like that on the following page, in which voting intentions are cross-classified following this 
new division. An examination of the data may also suggest a few groupings. In addition, 
Table 6 presents other relevant data. For example, the last two lines compare by class the 
number of individuals who answered the question regarding voting intentions with the total 
number of individuals surveyed who can be classified in accordance with the three variables 
involved. From this, we obtain a first weighting system. In the example, there are 283 per- 
sons overall who can be classified, but whose voting intentions are not known. In addition, 
the overall value of x? is 891, a distinct improvement over the situation when the variables 
were taken alone (Table 5). 

Finally, in the B; column, for each Pj, there appear estimates of the percentage of the 
variance that can be attributed to interclass variance. These B;’s measure the increase in 
precision (variance reduction) that can be attributed to adjustment of the data in accordance 
with the type of partition chosen. This is clear if we rewrite equation (4.1) as follows (disregar- 
ding relative variances): 


Vatp, = Vatipe— (1 — J) B; Var p; 


Having a B; equal to 61.9% for estimation of intention to vote for the PQ means that, 
from the point of view of variance reduction, adjustment of the data is equivalent to having 
recuperated in the field 61.9% of the 283 non-responses for the question on voting intentions. 

We now have the residual problem of determining how to adjust for non-response for 
specific questions using variables that have themselves been affected by non-response. 

In the example produced through division in accordance with Table 6, it is clear that a 
significant portion of the non-responses with respect to voting intentions is not corrected 
through this kind of weighting. In effect, we are left with 409 cases of non-response that 
cannot be dealt with in this fashion, since classification with respect to a reference variable 
cannot be determined. One possibility that might be explored here is establishment of a 
weighting system that would allow us to use, for each non-respondent, the maximum number 
of variables available for estimating the missing data. For example, the voting-intentions 
profile of persons who did not respond to the question on voting intentions or to that asking 
who would be the best premier, but who we know are Francophone and are satisfied with 
the government in power, would be inferred on the basis of the voting-intentions profile of 
the Francophone respondents satisfied with the government. A weighting system can easily 
be developed for this process of attribution. 
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Table 6. 
Study of a Partitioning of the Sample 


i e E EEE UyUEIEEEE ESS 


Best premier Johnson (PQ) 
Ph lia Nace te ell cali ites tla I ih at AAS ORCA DIL KEES 
Satisfied : Very or Not very or 
government Me Fay Fairly not all all 
ey eee eens Wee es ee ee ee EE ee 
Franco- Franco- Mur Franco- Sie 
Mother tongue rapes Boas franco- tee a franco- 
< P phone P phone 
Ce any aan BT eh Pree es he eee oe ei ees Pe ee ee ee ee 
% vote PQ 100 88.3 42.7 63.4 32.9 
% vote PLQ 0.0 9.5 45.8 30.2 61.4 
% vote Other 0.0 pe Mls 6.4 Sei 
Number of respondents for the 
classification and voting- 
intentions questions 51 342 3yy/ 133 2 
Number of respondents for the 
classification questions 59 404 50 203 28 
a eee a“ 
Best premier Bourassa (PLQ) 
NE Re eee ee eee eae eee ee 
Satisfied Very or Not at 
‘ Not very 
government fairly all 
a ee eee 
Non- Non- is 
Franco- Franco- Franco- 
Mother tongue Lae franco- eis franco- ents 
P phone P phone P 
at ah elt td ide nee spt at are ise nt ene i sem an ee a rt a tS 
% vote PQ |? 0.0 6.5 ily SS 
% vote PLQ 86.5 85.9 92.4 98.8 93.9 
% vote Other is! 14.1 1.1 0.0 2.6 
Number of respondents for the 
classification and voting- 
intentions questions 64 21 156 49 59 
Number of respondents for the 
classification questions 81 24 178 54 175 


ee 


i 


i 


Other Neither B 
Best premier than rae Satay TOTAL Bj (%) 
nor Johnson 
Johnson 
Satisfied Not at Very or Not very or 
government all fairly not all all 
Non- 
Mother tongue franco- — ---- 
phone 
SeTT EET STL et re ee ee 
% vote PQ 0.0 14.3 4.6 42.7 61.9 
% vote PLQ 100.0 73.5 54.9 52.6 58.7 
% vote Other 0.0 Dee 40.5 4.7 15s 
Number of respondents for the 
classification and voting- 
intentions questions 42 i 51 1144 x" =i 


Number of respondents for the 
classification questions 49 32 89 1427 


a 
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A Study of the Effects of Imputation Groups in 
the Nearest Neighbour Imputation 
Method for the National Farm Survey 


SIMON CHEUNG and CRAIG SEKO! 


ABSTRACT 


A new processing system using the nearest neighbour (N-N) imputation method is being implemented 
for the National Farm Survey (NFS). An empirical study was conducted to determine if the NFS estimates 
would be affected by using imputation groups based on type of farm. For the specific imputation rule 
examined, the study showed evidence that the effect might be small. 


KEY WORDS: National Farm Survey; Item non-response; Nearest neighbour imputation; Match variable 
transformation. 


1. INTRODUCTION 


The National Farm Survey (NFS) is an annual multi-purpose survey of agricultural activi- 
ty in Canada. The survey uses a 2-frame sample design i.e. a list frame of large farms (based 
on the quinquennial Census of Agriculture) and an area frame of agricultural land. The largest 
units in the list frame are sampled with certainty (i.e. with probability one) because of their 
disproportionate impact on the survey estimates. These units are called specified farms. The 
remaining farms in the list frame are stratified and sampled. The small farms in the survey 
population, which are comparatively very large in number, are covered by the area frame and 
sampled less extensively than the list frame farms. Thus three samples are selected: specified, 
list and area. The detailed NFS sample design has been described by Davidson and Ingram 
(1983), and Davidson (1984). 

The NFS is processed by a system adopted from predecessor surveys. This system employs 
the sequential hot-deck imputation method to adjust for unit and item non-response (Philips 
1979). A new survey processing system will be implemented in 1987 in order to integrate all 
the agricultural surveys conducted by Statistics Canada. This system will use the nearest 
neighbour (N-N) imputation method to adjust for item non-response. The decision to imple- 
ment the N-N imputation method was based on many reasons, among which there are three 
important ones: First, the use of the N-N method is theoretically more justified than the exact- 
matching sequential hot-deck method since the survey collects mostly quantitative data. 
Second, empirical studies, e.g. Kovar (1982), suggest that the two imputation methods would 
yield similar estimates for the NFS with the N-N method resulting in fewer outliers i.e. im- 
puted data which have disproportionate contributions to the survey estimates. Third, switching 
to this new imputation method for the NFS would help standardize the survey methodology 
of all agricultural surveys, a long term goal of Statistics Canada. Currently, the Census of 
Agriculture and the Farm Tax Data Survey both use the N-N imputation methodology. 

This paper reports on an empirical study which attempts to provide information that will 
help in a more efficient implementation of the new imputation method. The next section 
describes briefly the N-N imputation method adopted in our study. Section three presents 
the study procedure and the main results obtained. Finally, we discuss our preliminary obser- 
vations drawn from the results in section four. 


' Simon Cheung and Craig Seko, Business Survey Methods Division, Statistics Canada, ith Floor, R.H. Coats Building, 
Tunney’s Pasture, Ottawa, Ontario, Canada KIA OT6. 
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2. NEAREST NEIGHBOUR IMPUTATION METHOD 


The method of donor imputation, in general, is to replace the missing or invalid values 
of a respondent (recipient) with the valid response of another respondent (donor) who is 
deemed to have the same characteristics as the recipient. The sequential hot-deck imputation 
method identifies donors sequentially in the course of processing as those reporting the same 
values as the recipient in the pre-specified match variables. This method, however, often 
fails to obtain an exact match when a match variable assumes a large number of possible 
values. To alleviate this, the range of the match variable is split into intervals and the donor 
is obtained by matching on the interval code. In nearest neighbour imputation, this problem 
is solved by selecting a donor based on a multivariate distance measure which represents the 
degree of similarity between the donor and the recipient as defined by the pre-specified match 
variables. The more similar two respondents are with respect to the match variables, the smaller 
the magnitude of the distance. Thus, the best donor for a recipient is the donor candidate 
which has the smallest distance value from the recipient, i.e. its nearest neighbour in the 
sense of statistical distance. 

The nearest neighbour imputation method used in this study was proposed by Sande (1976, 
1981). This method uses the maximum norm based on transformed data as the distance func- 
tion. The method is described briefly below. 

Let X = (Xj, X>, X3, ....X;) be a vector of k match variables. Each match variable x; 
is transformed by ¢; = F(y), where F(y) is the empirical distribution function of x;. Note 
that ¢; follows the uniform distribution over [0, 1]. Then the distance between a given reci- 
pient X’ and a donor candidate X’ 4 defined by the maximum norm is 


d (X’, x2) = max | tj. a t;" | , 
J 


where ¢;” and ¢;7 are the transformed values of the j” match variable x; in X" and X a 
respectively. The donor candidate with the smallest d-value will be selected and its response 
will be copied for the missing item of the recipient. The uniform transformation may be 
considered as an objective method to scale the match variables regardless of their natural 
distributions. 


3. EMPIRICAL STUDY 


3.1 Motivation 


In adopting the nearest neighbour imputation method for the NFS, some issues regarding 
detailed implementation of this method need to be resolved, particularly in regards to transfor- 
ming match variables. The method of uniform transformation in the N-N imputation could 
be applied using all the records in the sample or using only subsets of the sample data. A 
group of unit respondents in which imputation for non-response takes place is called an im- 
putation group. Different imputation groups would yield different transformed values which 
in turn would result in different selection of donor records. 
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It was conjectured that transforming match variables within an imputation group defined 
by a homogeneity criterion which is closely related to the item to be imputed would result 
in a more correct scaling of the match variables, and hence would yield better imputed data. 
For example, in the NFS one may expect that match variable tranformation within imputa- 
tion groups defined by farm type should yield better imputed data and hence better estimates, 
‘better’ being in the sense of bias and variance reduction. Unfortunately, the transformation 
of match variables is costly in terms of computer resources. If one does not need to transform 
within homogeneous imputation groups, savings in computer costs can be realized. 

The main objective of the study was to answer the following question in an experimental 
setting: ‘Do the two methods of match variable transformation, i.e., transformation using 
all records vs. within farm type groups, yield substantially different survey estimates? If so, 
which method yields better estimates?’ 


3.2 Data Used in the Study 


After consultation with the subject matter analysts, the 1984 NFS sample for the pro- 
vince of Alberta was selected for the study. The sample of approximately 2000 farms con- 
sists of 50% crop farms, 27% livestock farms and 23% mixed farms. The population 
percentages of the three farm types were estimated to be 52%, 27% and 21% repectively. 
Farm types were assigned according to the main source of projected agricultural receipts 
of a farm. If at least 75% of a farm’s projected agricultural receipts came from its livestock 
inventory, the farm was classified as a livestock farm. A similar rule was used to classify 
crop farms. The remaining farms were classified as mixed farms. 


3.3 Method of the Study 


We assumed that the data was ‘clean’, even though it contained imputed values via the 
sequential hot-deck imputation procedure. Once the data had been classified by farm type, 
the following procedure was followed: 


1) Ten per cent of the values for each imputation variable was randomly set to a missing 
value within each farm type. This error generation was done independently for each im- 
putation variable. 

ii) The generated non-responses were imputed using the N-N imputation method based on 
the two sets of imputation groups defined by the whole sample (called ‘whole’) and by 
farm type (called ‘by-type’). The imputation procedures were carried out using the 
Numerical Edit/Imputation System (Statistics Canada 1982), as implemented within the 
P-STAT statistical package (Buhler and Buhler 1978). 

ili) The NFS weighted estimates for the variable totals for the province and for each farm 
type were produced based on each set of imputed data. 

iv) These steps were repeated 10 times to get 10 independent replications (i.e., simulations), 
and the results were averaged over the ten replications for each imputation variable. This 
average estimate was then compared with the estimate obtained based on the ‘clean’ file, 
both at the provincial level and for each farm type. 


The whole experiment was repeated for higher non-response rates of 15% and 20% in 
order to observe the impact of nonresponse rates. 
The imputation and match variables used in the study are shown below: 


Imputation Variables 


UELL = Utility expenses 
AUTO = Farm vehicle and machinery operating expenses 
TAX = Property tax 
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Match Variables 


Farm type (exact matching) 
FEED = HECULeCX DENSE 
SEED = FOCEds EX Pelise 
INCOME = Gross agricultural receipts 


In addition, the donor’s sample type was restricted by the recipient’s. Recall that three 
types of samples are used in the NFS: specified, list, and area. A specified farm can be im- 
puted by a farm from any of the sample types but can not be a donor to a list or area farm. 
Similarly, a farm from the list sample can be imputed from a farm in either the list or area 
samples but can only be a donor to farms that are in the list sample or are specified. Finally, 
farms in the area sample can only be imputed by another area farm but can serve as a donor 
to any of the three samples. These restrictions arise from the premise that if a list or specified 
farm was allowed to impute for an area farm, the imputed value could potentially raise the 
survey estimates to an unacceptable level because of the higher sampling weights associated 
with area farms. 


3.4 The Empirical Distribution Functions of the Match Variable 


Figure 1 shows the unweighted empirical distribution functions of the three match variables 
which are obtained from the imputation groups defined by the whole sample and by farm 
type. Note that the differences are substantial and hence could lead to the selection of dif- 
ferent donor records for a given recipient. 


3.5 Results 


The results are tabulated in Table 1. For each imputation variable (UTIL, AUTO or TAX), 
each of the two sets of imputation groups (whole vs. by-type), and each level of non-response 
rate (10%, 15% or 20%), the average value of the ten estimates for the variable total was 
calculated over the ten replications. The bias of this average value is displayed as a percen- 
tage of the ‘‘clean’’ estimate. The average cv over the ten replicates is also displayed as a 
percentage. 


4. OBSERVATIONS AND DISCUSSION 


This study imputed for three farming expense variables. The donor records were selected 
by exact matching on farm type and by nearest-neighbour matching on three variables: gross 
agricultural receipts, feed expense and seed expense. The two expense match variables were 
believed to be of different effectiveness for the three farm types. For example, feed expense 
was expected to work better for livestock farms but not so for crop farms, etc. The strength 
of correlation between the match variables and the imputation variables presented in Table 2 
seems to support this expectation. 


Therefore the homogeneous subsets based on type of farm have differing relationships 
for the match variables. This might imply that transformations using imputation groups 
defined by these subsets would perform better than using the entire sample as an imputation 
group. The results, however, indicate that using these homogeneous subsets as imputation 
groups does not seem to yield substantially different estimates or lower bias. The bias itself 
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Figure 1: Empirical Distribution Functions of Match Variables 
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Table 1 


Percentage Bias and cv’s for the Totals of the Imputation 
Variables after Imputation 


Imputation Variables 


Non- ; 
Imputation 
response LOMO AUTO TAX 
ate group On ee ee a Se pears: 6 pat ee, Nea a a. on 
% Bias % cv % Bias % cv % Bias % cv 
All Farms in Sample 
clean 3.137 2.831 3.224 
10% by-type 0.176 3.165 — (0.004 2.849 0.228 3.260 
whole 0.124 3.143 — 0.074 2.840 0.199 3.296 
15% by-type 0.339 3.195 0.604 2.885 0.255 SIs: 
whole 0.336 3.131 0.278 2.870 — 0.624 3.289 
20% by-type 0.869 Bell 7/3) 0.023 2.875 —0.715 3.280 
whole 0.554 3.111 —0.150 2.843 — 0.877 3.285 
Crop Farms 
clean 4.829 4.092 4.536 
10% bt-type 0.023 4.872 0.516 4.159 0.200 4.574 
whole —0.221 4.829 0.328 AN \\ 35) 0.371 4.625 
15% by-type 0.468 4.981 0.611 4.200 0.855 4.695 
whole 0.156 4.863 — 0.199 4.231 — 0.026 4.672 
20% by-type 0.402 5.008 0.620 4.238 — 1.201 4.770 
whole —0.170 4.944 0.129 4.227 — 1.158 4.699 
Livestock Farms 
clean 6.770 5.596 9.527 
10% by-type 0.125 6.798 — 0.885 Sous 0.688 9.471 
whole 0.687 6.800 — 0.487 Sev? — 0.093 9.515 
15% by-type 0.234 6.829 0.156 S523 0.346 9.325 
whole 0.789 6.797 0.646 5.533 — 1.666 So) Pa 
20% by-type 12526 6.920 — 0.370 5.538 0.654 9.250 
whole 1.136 6.830 — 0.051 5.495 — 0.354 9.565 
Mixed Farms 

clean 7.433 7.190 6.993 
10% by-type 0.570 d-519 — 0.549 UMS — 0.092 7.029 
whole 0.093 fe507 — 0.715 eles — 0.009 D027 
15% by-type 0.219 7.404 0.957 7.150 — 1.437 7.143 
whole 0.115 7.407 L142 7.107 — 1.335 Ties kewe 
20% by-type 0.984 7.541 — 1.108 6.984 — (0.599 7.010 


whole 1.303 A595 =10.927 7.001 =0.576 7.050 
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Table 2 


Correlation Coefficients between Match and 
Imputation Variables 


Match variables 


Farm Imputation 

Type variable FEED SEED INCOME 
UTIL 0.46 0.39 0.50 

whole AUTO 0.34 0.18 0.50 
TAX 0.10 0.16 0.27 
Ui 0.13 0.57 0.69 

crop AUTO 0.25 0.28 0.65 
TAX 0.18 0.19 0.48 
UTIL 0.64 0.25 0.51 

livestock AUTO 0.41 0.47 0.52 
TAX 0.13 0.25 0.28 
UTIL 0.55 0.49 0.76 

mixed AUTO 0.48 0.46 0.73 
TAX 0.24 0.45 0.55 


4 The coefficients are based on unweighted data from the 1984 NFS core sample in Alberta. 


seems negligible at low rates of non-response. As the non-response rate rises, the bias grows 
but is still not substantial. Except for the variable TAX, the differences between the estimates 
seldom exceed the 95% confidence limits. In the case of TAX, statistical significance, when 
detected, is usually at the 15% and 20% non-response rates. Unfortunately, the average 
estimates for the variables UTIL and TAX do show a pattern of consistent, positive bias. 
No explanation is obvious for this observation and further investigation is warranted to un- 
cover the potential source of bias. 


Thus, there is no need to transform match variables by imputation groups defined by farm 
type for the imputation studied; transforming match variables using the whole sample leads 
to very similar survey estimates. This may not be the case for other imputation rules and 
patterns of non-response that are not random. These are topics for future studies. Although 
the imputed estimates compare well with the clean estimates in practical terms, however,there 
may still be some unknown sources of bias. These sources, if they exist, may be related to 
this imputation method, to the imputation rule examined in this study or some other uniden- 
tified factor. It is suggested that the presence of bias be confirmed and if confirmed, its source 
determined. Further study is recommended to this end as well as to aid in determining future 
imputation rules for the National Farm Survey. 
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Estimating a Monthly Index Based 
on Trimestrial Data 


JOHN G. KOVAR! 


ABSTRACT 


A problem of estimating monthly movements in rents based on data collected every four months is ex- 
plored. Five alternative composite estimators of the rent index are presented and justified, both from 
an intuitive as well as theoretical point of view. An empirical study testing and comparing the proposed 
methods is described and summarized. Recommendations are put forth. 


KEY WORDS: Index numbers; Rotating samples; Composite estimation. 


1. INTRODUCTION 


The rent component of the Consumer Price Index is based on data collected on a six month 
rotating basis using a Labour Force Survey Supplement. Since changes in rents generally occur 
on an annual basis, the effective sample size of the Labour Force Survey design is reduced. 
Furthermore, special annual benchmarks, which are obtained by revisiting the June sample 
of dwellings one year later, indicate that the rent component can suffer from varying degrees 
of bias (Dolson 1982). To ameliorate the situation, several data collecting schemes were pro- 
posed in order to combine the monthly data with the yearly benchmarks in a continuous and 
timely fashion. One of these methods, which collects data every four months, was selected 
for practical application. 

The proposed design consists of four sets of four rotation groups of rented dwellings, each 
set of which is to be surveyed in one of four consecutive months, on a rotating basis. Each 
month, one rotation group is surveyed for the first time and the other three are those that 
rotated in four, eight and twelve months ago respectively. Each group would thus be surveyed 
four times over a period of thirteen months,before rotating out of the sample. Every month, 
data on current rents, as well as matched rents collected four months ago, are available from 
exactly three rotation groups (the fourth group is new and thus has no matching ‘‘backrents’’). 
Yearly benchmarks can be calculated monthly based on one rotation group. This paper discusses 
several methods of estimating a monthly index based on such trimestrial data. 

In estimating the indices, the constraints of the Consumer Price Index publication policy 
must be kept in mind. In other words, it must be practically as well as technically possible 
to produce the indices on a monthly basis for each of the index cities. The estimates must 
be timely: produced no later than mid-month following the reference month. Furthermore, 
no revisions can be made once the indices are published. While not entirely essential, it would 
be desirable that any proposed estimator be able to reflect (real) sudden changes in trend very 
quickly. On the other hand, in order to remain credible,the indices must be relatively stable: 
volatile, saw-toothed indices are to be avoided. 


! John G. Kovar, Business Survey Methods Division, Statistics Canada, 11th floor, R.H. Coats Building, Tunney’s 
Pasture, Ottawa, Ontario K1A OT6. 
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In Section 2, five estimators will be presented, justified, and compared on a theoretical 
basis. Some empirical adjustments to these indices will be discussed in Section 3. In order 
to compare the performance of these estimators over time and between locations, a simula- 
tion study involving eight cities with observations over a period of 48 months was performed. 
The results of the study are presented in Section 4. The conclusions and recommendations 
can be found in Section 5. 


2. INDEX ESTIMATORS 


In this paper, only matched indices will be considered. While relative changes could easily 
be derived by comparing independent (unmatched) estimates of rent levels at distinct time 
points, such estimates of levels would have to be very reliable, necessitating prohibitively 
large sample sizes. Moreover, past studies indicate that such direct estimators tend to be 
volatile, upwardly biased and generally not practical in use (Szulc 1983). In what follows, 
therefore, an estimate of relative change between two time points will be based only on those 
units that report rents for both of these time points. 

We will denote by x,, the total rent paid, in the current month m, by a certain subset s 
of dwellings in a given city. Thus, more rigorously, 


ee Daa 2.1) 


i€s 


where X,,; denotes the rent paid by the i-th dwelling in month m. The rent index is custom- 
arily estimated by chaining one month relatives, that is, the ratios of average rents between 
two consecutive months denoted by r7”_,. In other words, the index in month m, [,,, over 
a base period zero, is estimated recursively by 


a a 


Doves Upieen XO ElearS001 Bloxie j 


% moDicreasa Aviagh (2.2) 
where 100 is the (abritrary) level of the index at time zero. The difficulty then rests only 
in estimating the relatives. 

In general, consider the relative change in rent in month m over month 1, denoted by 
ri’. This ‘‘m over 1 relative’? can be estimated by 


1 = Xm/X\. (2.3) 


However, if one considers matched indices only, the only estimable relatives under the 
proposed design are the four-month relatives, in other words, those of the form needy» 
j = 1, 2, 3, because it is only in these cases that there are common units between the two 
months. These relatives are estimated by 


Pin—4j = Xm/Xm—4j: (2.4) 


where the set s of dwellings consists of only those units that report rents at both time m 
and m—4/. Unfortunately, the interest lies in estimating monthly relatives of the form rj;,_1. 
On the positive side, the rotation scheme ensures that a four-month relative is available every 
month. It is also assumed that units rotating out of the sample ‘are replaced by equivalent 
units rotating into the sample. As such, the set s of common dwellings in (2.1) depends on 
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the time m only and any future reference to it, while implicitly retained, can thus be sup- 
pressed in what follows. For a rigorous discussion of these assumptions and the effect on 
the index if the assumptions fail, the reader is invited to consult Szulc (1983) and Kovar (1984). 

In the following paragraphs, five methods of estimating monthly relatives from four-month 
relatives will be described. Each will be justified intuitively as well as theoretically, and its 
advantages and disadvantages will be pointed out. The first three methods are derived on 
a theoretical basis alone while the fourth attempts to exploit the rotation pattern of the survey. 
All four assume that at least a four month back history of data is available. The last approach 
takes advantage of prior empirical knowledge: that of high probability of observing one change 
in rent per year. Methods two and four have been discussed earlier by Kovar (1984). 


2.1 Interpolated Index (Additive Index) 


One way of estimating the relative r7_, is to estimate the previous month’s rent, W215 
This can be accomplished, among other methods, by linearly interpolating the observed rents 
at time m and m — 4, that is, by assuming that the rents increase (decrease) linearly over 
time. Note that this assumption does not require each individual rent to increase every month 
by a fixed amount, but merely that the sum of all the rents does. In general, to describe 
linear interpolation briefly, consider two measurements of the same quantity at two distinct 
time points, say y, and y,_,. Suppose that we wish to estimate the value of y at some point 
between the times ¢ — sand tf, say attimet — u(u <s). Assuming that the measurements 
increase linearly in time, y,_,, can be estimates from y, and y,_ s by 


u u 
vl =e ‘Fie see (2.5) 
Ss Ss 
or in the case at hand, where s = 4 and u = 1, by 
Vi-t = (Ak! CA) Veg) (2.6) 
Thus the previous month’s total rent can be estimated by 
<m=1>> Ca) Kier 4 a (%)Xm (2.7) 
and consequently, the monthly relative for month m by 
Xen 4X 


rm = bs fA ie hed tag (2.8) 
Xm-1 Xm-4 + 3Xim 


The index is then derived by chaining the relatives as in (2.2) above. 

Provided that the rents follow the linear interpolation model, that is, provided that we 
can write the current month’s rent as a recursive function of previous months’ rents, namely, 
as 


Meo et tO = Nh ta (2.9) 


then it can be shown that the index at time m is given by I, = X,,/Xo, as is desired. In other 
words, if the data follow the model in (2.9), the index will suffer no time lags. But, of course, 
if the model were true at all times, the index would be fixed for all time points, based on 
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any two observations. Since this is clearly not the case, one can at best use (2.8) as an 
approximation over short periods of time only. In that case, however, if the relationship 
in (2.9) is not exact, the index at time m will depend on all the rents between time —4 and 
m. In other words, the index is then susceptible to accumulating various biases over time. 

Note that the same index would be derived by assuming that the four-month increment, 
Xm — Xm—4, occurred in 4 equal additive steps: (X_, — Xm—4)/4. Since then, the previous 
month’s rent would be estimated by 


Nad he, hie Mp A (2.10) 
which is the same as (2.7); hence the alias: additive index. 


2.2 Geometric Index 


In this section, in contrast to the above, we will attempt to estimate the relative directly. 
We first note that 


Xm Xm Xm-1Xm-2Xm-3 

m 

yi fae = (2.11) 
Xm—4 Xm—1 Xm—2 Xm-3 Xm—4 


Ps AYP m—-1 ,m—2 ,m—3 
= Tm-\ 'm-2m-3 "m-4:- 


We then assume that the four relatives on the right hand side of (2.11) are equal, or 
equivalently, that the four-month movement is due to four equal movements which act 
multiplicatively (Kosary et al. 1982). Under this assumption, the relationship (2.11) can be 
written as 


m 


re oe Gl (Tae (2.12) 


From (2.2) and (2.3), assuming that there are no sample changes or that units rotating 
out of the sample are replaced by equivalent units rotating into the sample, the index in month 
m over the base period zero becomes 


Y SOT i ee ESC eg 


1 \% 2 \% Vs 
=o XK (63) Oe 7) a K Ole 4) 


Ys 
cox ee Xm—-2 Xm-1 Ee) . 


(ANGE eee SS (2.13) 


Ip 


In other words, the index is a ratio of two geometric averages; hence the name geometric 
index. We note that at any time, assuming the panels are stationary, the index depends on 
eight months worth of data only, and thus is independent of any movements between time 
0 and m—4, though in practice matched sets contributing to each rj,_4 are different, so the 
cancellation is only theoretical. By contrast the index suffers from one-month to three-month 
lags and will thus tend to dampen true sudden changes. These changes, however, will be 
reflected eventually, that is, the index will selfcorrect (Kovar 1984). 
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As a point of clarification, note also that the relatives in (2.12) can be rewritten as 


or as 
ie Vs Y, 
Xm-1 = (Xn) : (X70) ; 

or finally as 


log (%m—1) = (Y%) log (Xm—4) + (%4) log (Xm). (2.14) 


The geometric index is therefore equivalent to an index derived by estimating the previous 
month’s rent by linearly interpolating the logarithms of the observed rents at time m and 
m — 4, (See (2.6) with y,, = logx,,.) 


2.3. Incremental Index 
Analogous to the above geometric index, here we assume that the four consecutive month- 
ly relative net increments are equal and acting additively. More precisely, we can write rj'as 


Peale iy 


where /{'is the relative net increment in month m over month 1. To estimate rm—1 We need 
therefore 7;,_;. Assuming that the available i”_, = 4im—1, the relative r/”_, can be 
estimated. Namely, we will estimate i””_, by 


im—1 = (A)im—4 = (4) (rm-4 — 1) = (4%) (ee + 1), (2.15) 
m—4 


and r7_, by 


Co te (2.16) 

4Xm—4 

We note that ri, = Xm/Xm_, and thus (2.16) can be written as 
Xin | EEX Sy Oe 
Xm-1 aX 4 
or as 

1 | 
= (A) £4 s- (2:17) 

Xm-1 m—4 Xm 


In other words, the incremental index corresponds to one which would be derived by 
estimating the previous month’s rent by linearly interpolating the reciprocals of the observed 
rents at time m and m — 4. (See (2.6) with y,, = x,!.) 
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As is the case with the interpolated index, the incremental index will be independent of 
the intermediate observations only under the restrictive condition that the interpolation model 
be followed. In this case, analogous to (2.9), the model is 


1 1 
a I. (2.18) 
Xm Xo 


However, in most real situations, the chained incremental index will depend on all the 
data between times —4 and m and therefore will be susceptible to various accumulating biases. 

Since all three indices discussed to this point can be described in terms of linear interpolation 
of various functions of the observed rents, it is also possible to compare them theoretically. 
It can in fact be shown that the three indices are ordered in magnitude, from smallest to 
largest in the order of their presentation. That is, in an inflationary situation the interpolated 
index will always be smaller in absolute value than the geometric index which in turn will 
always be dominated by the incremental index. The reverse holds true when the trend is 
downward, that is, when prices are decreasing. As one referee pointed out, this phenomenon 
can be explained by noting that ‘‘the interpolated, geometric and incremental relatives are 
respectively the weighted arithmetic, geometric, and harmonic means of rent quotations four 
months apart. The standard relationship between these means explains the behaviour of the 
estimates in inflationary or deflationary times’’. 


2.4 Carried Index (Arithmetic Index) 


The carried index is constructed by taking advantage of the rotating sample at hand. Noting 
that all units reappear periodically in the sample, we construct the index by simply carrying 
each unit’s rent value forward until a new observation is recorded. In this way all units on 
the file have a matching previous month’s rent and thus the monthly relative, r;,_,, can be 
constructed in a straightforward manner. The obvious drawback is that the rent increases 
(decreases) are not recorded until observed. However, since all changes are eventually recorded, 
the index will selfcorrect (Kovar 1984) but will suffer from a mixture of one to three-month 
lags. Just as for the geometric index, sudden (real) changes will be dampened but the carried 
index will reflect them eventually. 

On the technical side, we note that in computing the carried index for any given month 
one quarter of the observations on the file reflect a four-month movement, whereas three 
quarters of the observations are carried for one to three months and reflect no change. In 
fact, in month m we observe x,, and carry X—1, Xm—2 and X,,_3. Similarly, in month m—1 
we observe X,,_; and carry X25 Xm—3 and xX, 4. The monthly relative is therefore given by 


Nap stee Mile eee eS 


m a 
isi = 


5 (2.19) 
Xm—-1 + Xm-2 + Xm-3 + Xm-4 


Chaining the relatives as in (2.2), and assuming again that the samples are stationary, 
we obtain the index for month m over the base period zero as 


\ iag—dg Xpeg EX ee ee e (2 20) 
= lo . ‘ 
y Xs Sr OS) Absa EtemeXG 


In other words, the index is a ratio of two arithmetic averages. Analogous to the geometric 
index, the carried index depends on eight months worth of data only, and thus is independent 
of the movements between time 0 and m — 4. As mentioned above, it too suffers from one 
to three-month lags, and therefore dampens sudden changes. 
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2.5 Annual Index 


Empirical observations suggest that most units change rent once a year. One could therefore 
argue that yearly relatives are more stable than monthly relatives, since the distribution of 
individual monthly relatives will necessarily demonstrate two spikes, one around the annual 
relative and the other at 1. The rotation pattern of the proposed rent pilot (Kovar 1984) en- 
sures that an annual relative be estimable every month, that is that rm—12 be available. To 
compute the annual index on a monthly basis, we note that for any chained index the following 
relationships hold: 


| Cpe ty wh Cet (2.22) 
and 


Lin! Im —12 = Tm—12+ (2.22) 


From these relationships we obtain an expression for a monthly relative r/"_, as 


m 


Pm=1) = m—-12 Im=12/Im-1.- (2.23) 


These relatives can then be chained as above to produce an index. Since such a relation- 
ship is recursive, we need 12 months worth of indices to be able to “‘start up’’. One possi- 
bility that exists, is to define the index for the first 12 months, by analogy to the geometric 
index, as 


|e Cine NA eg PS ert ect De (2.24) 


As defined, the annual index is independent of intermediate changes. On the other hand 
it will be saw-toothed unless individual monthly sample sizes are large. This is due to the 
fact that consecutive monthly estimates are totally independent. Moreover, it must be noted 
that the lagging problem will be at least as serious in the case at hand as it is for the indices 
presented earlier. 


3. ADJUSTMENTS 


In this section, two adjustment procedures for the above indices will be discussed. First, 
because the first four indices suffer from one to three month lags, they will smooth out true, 
sharp peaks. From prior data, it has been observed that rent indices do exhibit sharp rises, 
in certain cities, with some regularity. To ‘‘correct’’ the smoothed out index, an empirical 
adjustment will be proposed. By contrast, due to the volatility of the annual index, a smoothing 
adjustment will also be proposed. 


3.1 Empirical Adjustments 


It is known, for example, that most rents in Montreal change in July. The first four in- 
dices discussed in the previous section would distribute this July change over July, August, 
September and October. One could however adjust the index in July to reflect a larger change 
and counter adjust it in the following three months. More precisely, the index could be 
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multiplied by r* in the reference month and then by (7*) ~ * in each of the following three 
months. Since all the proposed indices are chained indices, in the third month after the 
reference month the four multipliers will offset each other, leaving no trailing biases. As 
for the choice of r*, this will depend on continued empirical observations in each particular 
city. 

It is to be noted that such adjustments must be performed in rare situations only and with 
great care. It is imperative that the particular situation be monitored, for it is not uncommon 
for such aberrations to disappear suddenly. 


3.2 Smoothing 


Asa last effort in redeeming a volatile, saw-toothed index, one could consider smoothing 
it. Like the above adjustments, smoothing should be considered in rare and extreme situations 
only: in cases where no other alternative exists. The smoothing procedure we consider here 
involves averaging the index at time m with a linear extrapolation to time m of the smoothed 
index from time m — 1 and m — 2. One possible choice of the smoothed index at time m, 
Sm, is then given by 


Sin 


FD (2S 21 5S— 5) 12 


# og latin Of pYOdegee BYH5: (3.1) 


Since the smoothing operation basically projects past data into the future, the smoothed 
index will extend past trends and therefore introduce some lags. Moreover, the method is 
recursive and consequently could also introduce unwanted biases. Other smoothing methods 
could be considered, although the utility of smoothing an index that suffers from serious 
lags is questionable. 


4. EMPIRICAL STUDY 


The study described in the following paragraphs was initiated in order to test the perfor- 
mance over time of the proposed indices and adjustments. The study provides quantitative 
information on the ability of the indices to track the true index accurately. It supports the 
mostly heuristic observations made above and reinforces the theoretical ones. 


4.1 The Population 


The population of rented dwellings used in this study was designed to duplicate the real 
situation as closely as possible. For this purpose, the cities,their sizes, and their sample sizes 
were selected to correspond to those used by the Rent Component of the CPI. Since all real 
data on rents is available for periods of six months only, the needed thirteen months of data 
had to be simulated. Eight cities were chosen for this purpose. Some are large, some are 
small, some have periodic jumps in their indices, but all are CPI index cities and have suffi- 
cient amount of rent data available. Moreover, while some of the indices in these cities are 
strictly increasing, others are both increasing and decreasing. 

Only the initial rents of all units (those collected when the unit rotated in) on the CPI 
rent database for the years 1979 to 1984 inclusive, for the eight cities mentioned above, were 
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Table 1 


Average Sample Sizes (Distinct Units) and the Index at 8401 
for Eight Cities Based on the Simulated Population 


Average Monthly Index at 8401 
City Sample Size (8001 = 100) 
Halifax ou 144.3 
Montreal 268 136.6 
Ottawa 35 130.0 
Toronto 170 130.4 
Winnipeg 105 132.0 
Edmonton 1B F25¢2 
Calgary 97 IPBIES) 
Vancouver 105 130.5 


—————— eS 


retrieved. For each unit, twelve additional months worth of data were then simulated using 
the observed parameters. (This approach is Operationally easier then simulating seven months 
of data in addition to the existing six.) More precisely, for each unit, first a decision was 
made whether or not a change in rent will occur sometime in the next twelve months. The 
probability of this event was set to be equal to the observed probability of a rent change 
in that particular city and year. Then, given that a change was to occur,the appropriate month 
was selected proportional to the observed incidence of rent changes, again specific to the 
city and month at hand. The actual amount of the rent change was assumed to be distributed 
normally with a fixed mean and variance. Robust estimates of these two parameters were 
obtained from the existing data for each city and each month. 

All programming was done in SAS (Statistical Analysis System). The random numbers 
were generated using the routines RANUNI and RANNOR. The resulting population con- 
sists of eight cities and four years of fully rotated data (that is,discarding start up months). 
The average monthly sample sizes and the value of the simulated index for January 1984 
(with Jan 1980 = 100) can be seen for each city in Table 1. The indices, calculated for each 
of the cities, resemble very closely those observed originally. In the following comparisons, 
the indices of the simulated population were taken to be the true reference points to be 
reproduced. 


4.2 Comparison of Indices 


For the purpose of calculating the indices, it was assumed that of the 13 available obser- 
vations for each unit, only those for months 1, 5, 9 and 13 were actually observed. All 
calculations were then based on this (4/13) subsample.The five indices described above were 
calculated for each city and compared to the true index. All indices are fixed at 100 in January 
1980. The empirical adjustment was tested with the Montreal, Halifax and Winnipeg data, 
for the month of July, January and October respectively. While the results for all the possible 
combinations of cities and indices are too numerous to include herein, they are available 
from the author. Some selected highlights will be put forth in the following paragraphs. 
While not exhaustive, they are hoped to be representative as well as indicative of the situation 
at hand. 
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Figure 1. Plot of the True Index and the Inter- 
polated Index for the City of Ottawa 
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Figure 3. Plot of the True Index and the Incre- 
mental Index for the City of Ottawa 
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Figure 5. Plot of the True Index and the Annual 
Index for the City of Ottawa 
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Figure 2. Plot of the True Index and the Geo- 
metric Index for the City of Ottawa 
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Figure 4. Plot of the True Index and the Carried 
Index for the City of Ottawa 
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Figure 6. Plot of the True Index and the Annual 
Index for the City of Toronto 
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As can be seen in Figures 1-5, all five indices track the true index reasonably well, even 
in the case of small sample sizes such as in the city of Ottawa. As expected, the first four 
indices show some lags, those being more pronounced in the carried and interpolated index. 
(Note that the lagging problem could likely be accentuated by generating the population with 
exponentially increasing prices). Not surprisingly, the annual index is rather volatile. For 
cities with large sample sizes however,(e.g. Toronto), the annual index performs well (see 
Figure 6). While the smoothing adjustment of Section 3.2 does indeed smooth the index, 
the results are less than satisfactory as can be seen in Figure 7 (c.f. Figure 5). Perhaps a 
larger number of points should be used for the extrapolation but then the lagging problem 
would be even more pronounced. Figure 8 further demonstrates how sudden unexpected 
changes in trends are reported with a delay. However, expected jumps in the index (as in 
July in Montreal, Figure 9) can be adjusted successfully using the adjustment procedure of 
Section 3.1 (Figure 10). 
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Table 2 


Mean Square Errors of Five Indices in Eight Cities 


a 


City Interpolated Geometric Incremental Carried Annual 

Halifax 30* (3) je (2) jheAs (1) 48 (4) 74 (5) 
Montreal 48* (3) 24* (2) Bhs (1) 160 (5) 82 (4) 
Ottawa 17 (3) 12 (2) 8 (1) 22 (4) 95 (5) 
Toronto 36 (4) 27 (3) 20 (2) 29 (5) 13 (1) 
Winnipeg PEE be (3) igs (2) 10* (1) 66 (5) 41 (4) 
Edmonton 46 (1) 64 (4) 88 (5) 55 (3) 50 (2) 
Calgary 56 (2) 81 (4) 121 (5) 64 (3) 46 (1) 
Vancouver 70 (5) 53 (2) 39 (1) 64 (4) 60 (3) 


a 


Note: 1. Bracketed figures indicate ranking within cities. 
2. Starred figures are results of adjusted indices as per Section 3.1. 


Mean square errors of the five indices away from the true index have been calculated for 
each city (Table 2). The three interpolation based indices (interpolated, geometric and in- 
cremental) have been adjusted for the cities of Montreal, Halifax and Winnipeg. Table 2 
also presents the rankings (from smallest to largest) of the mean square errors of the five 
indices within each city. The carried and the annual index tend to perform the worst. The 
three interpolation-based indices perform relatively alike. In general, in cities where the index 
is climbing consistently, the performance of these three indices worsens in the order: incre- 
mental, geometric, interpolated. The order is reversed in cities where sharp decreases in the 
index have been observed. It is unlikely, however, that the strategies could be interchanged 
based on observed behaviours only. 


5. SUMMARY 


Both the theoretical as well as the empirical observations suggest that the yearly index 
is too volatile in cities where sample sizes are not large enough. Smoothing, at least of the 
type described, has proven fruitless. For this reason the annual index should be reserved only 
for those rare cases where sample sizes permit. On the other hand, the annual index could 
be used in conjunction with one of the more stable four-month indices to produce a compos- 
ite estimate analogous to that proposed by Kosary et al. (1982). However, empirical obser- 
vations would be needed to determine the appropriate weights to be used in averaging the 
two indices. 

By contrast, the carried, and to some degree, the interpolated index tend to be too smooth. 
That is they tend to smooth out all peaks in addition to demonstrating a one or two (index) 
point lag. While the incremental and geometric indices are not entirely free of these lags, 
they tend to track the true index a little more closely. The incremental index performs the 
best overall, however, because of the mathematical ‘‘cleanliness’’of the geometric index (i.e. 
its theoretical independence of its history and its correspondence to the chaining structure), 
it is the latter that is recommended here. In other words, the geometric index does not retain 
terms that could cause biases in the long run. 
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It is also apparent that whenever possible, prior knowledge can be used to improve the 
index. Empirical adjustments as described in Section 3.1 can be useful, provided that they 
are well founded. If their use is contemplated, it is imperative that the empirical knowledge 
that leads to their application be monitored and its continued existence verified. 
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Regression Analysis Using Survey Data 
with Endogenous Design 


ARIE TEN CATE! 


ABSTRACT 


This paper discusses the influence of the sampling design on the estimation of a linear regression model. 
Particularly, sampling designs will be discussed which are dependent on the values of the endogenous 
variable in the population: endogenous (or “‘informative’’) designs. A consistent estimator of the regres- 
sion coefficients is given. Its variance is the sum of a sampling design component and a disturbance 
term component. Also, model-free regression is briefly discussed. The model-free regression estimator 
is the same as the model estimator in the case of an endogenous design. 


KEY WORDS: Regression; Survey sampling; Endogenous design. 


1. INTRODUCTION 


The heart of any statistical model is the assumption that the value of one or more variables 
is generated by drawing from some probability distribution; for example, a regression model 
with normally distributed disturbances. In this paper a finite set of elements which behave 
according to such a model will be considered. This set is called the population. Next, a sample 
is drawn from this population, without replacement. The subject of this paper is the influence 
of the sampling design on the estimation of the parameters of the model. This influence 
depends mainly on whether the design is exogenous or endogenous with respect to the model. 
In the case of an endogenous (or ‘‘informative’’) design, the sampling probabilities depend 
on the value of the endogenous (‘‘dependent’’) variables. Then, the design should not be 
ignored in the estimation of the model parameters. The nature of the problem is indicated 
in Figure 1, where a stratified sampling design is shown. There are 3 strata, defined in the 
endogenous variable of a regression model. The middle stratum has a higher sampling fraction 
than the other two. The diagram shows that the slope of the regression line estimated using 
the sampled data points only, is biased downwards if one ignores the design. This bias does 
not vanish in large samples. This can be seen in an intuitive manner by imagining that every 
white and black dot in Figure 1 denotes a large number of identical data points. Even if 
this large number tends to infinity, the slope of the estimated regression line will be biased 
downwards, because the shape of the scatter will remain the same. 

There is a rapidly growing body of literature on the application of regression techniques 
in finite population sampling. This literature deals with a variety of problems. One problem 
is, how to use regression techniques in order to estimate a finite population total. Another 
problem concerns the estimation of population parameters such as Lxy/Lx*, where the 
summation runs over all elements of the finite population. Reviews of the literature about 
these problems are given by Nathan (1981) and Smith (1981). A third problem is the estimation 
of the parameters of a regression model, using a sample from a finite population. This problem 
can be solved relatively easily in the case of a exogenous design. See Porter (1973, Section 
1.2), DuMouchel and Duncan (1983), and textbooks such as Cramer (1971, p. 143). Texts 
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Figure 1. The Effect of Endogenous Stratification on the Estimated Regression Line 


such as Kmenta (1971, Section 8.3) and Johnston (1972, Section 9.2) discuss the closely related 
topic of stochastic regressors. See also White (1980a) for non-linear regression. Our topic, 
regression analysis with endogenous design, is more complicated. Hausman and Wise (1981) 
discuss stratified endogenous designs in a very simple case: two strata and a regression model 
consisting of a constant term only. Jewell (1985) gives some iterative estimators for the case 
of endogenous stratification. 

Regression analysis with endogenous design is related to the problem of endogenous non- 
response in regression analysis (see Heckman (1979)). However, we have a lesser problem 
here, since the probabilities involved in the sampling process are assumed to be known: they 
constitute the chosen design. On the other hand, as we shall see in Subsection 6.1, variance 
estimation with an endogenous design is in general rather difficult. 

Regression analysis with endogenous design may be compared with logit analysis with 
endogenous design, also called logit analysis with choice based sampling or case-control 
sampling. See Manski and McFadden (1981, Chapters 1 and 2) and Breslow and Day (1980, 
Section 6.3). 

The contents of the rest of the paper are as follows. In Sections 2 and 3 the main theorems 
are given. These theorems give a consistent estimator of the parameters of a linear regression 
model, using a sample with an endogenous design. Consistency is defined here in a similar 
way as in the discussion of the bias in the example above, though slightly more subtle: the 
x-values are replicated a large number of times and the y-values behave according to the 
regression model. In Sections 4 and 5 the variance of the estimator of the regression coeffi- 
cients is studied. Section 6 discusses the estimation of this variance. Section 7 deals with model- 
free regression, Section 8 discusses the various motives for weighted regression and finally, 
Section 9 concludes the paper. 
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2. THE MODEL, THE SAMPLE AND A 
REGRESSION ESTIMATOR 


In this section the asymptotic properties of an estimator of a regression model are studied 
within the framework of finite population sampling without replacement. Asymptotic theory 
for samples drawn without replacement from a finite population may seem a contradiction 
since such a sample must be bounded. This contradiction is solved by increasing both the 
population size and the sample size, without bound, at the same rate. The dependence between 
the inclusions of population elements in the sample constitutes another problem, especially 
in the case of complex sampling designs. Here we use an idea of Brewer (1979). In Brewer’s 
system, limit theorems on sequences of independent variables can be used, while the results 
may still be applied to complex designs. Basically, this system consists of the replica idea 
already introduced informally above. This replica idea will be used extensively throughout 
the rest of this paper. For another approach, see Robinson (1982). 

First, the structure of the population and the model are given. Consider a finite set of 
No elements. Each element has r real-valued exogenous non-stochastic characteristics, 
together forming an (No X r)-matrix Xo. One of the fundamental assumptions of this paper 
is the following. The population consists of K replicas of this set of No elements, having 
N = KN, elements. Its matrix of exogenous variables is X, with 


ASS lk ® Xo. (1) 


Here, tx is the K-vector with all elements equal to unity and ® denotes the Kronecker 
matrix product. Aymptotic results will be derived by allowing K to tend to infinity. 

The model assumptions describe the standard linear model. Each of the N elements of 
the population has a score on a stochastic, endogenous, variable. Together they form an 
N-vector y. It is assumed that 


E;(y) = XB (2) 


for some fixed, unknown r-vector 8. E ¢ denotes the expectation over all ye R™. Next we 
define 


Bs aed. Gee (3) 


It is assumed that the N elements of « are i.i.d. It follows from (2) that all elements of 
€ have expectation zero. Their variance is 0”, that iS, 


1S ADRS Gee (4) 


Sampling is done without replacement here, as is common practice. The sample is described 
by a diagonal (N x N)-matrix T, such that 


{ 1 if population element ij is in the sample 
lig = 


0 otherwise 
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for alli = 1, ..., N. Obviously, T is idempotent. The sample space S is the set of all such 
matrices T. This set is finite. The sampling design is some probability distribution over the 
elements of the sample space S. The sampling design is endogenous here, meaning that it 
depends on y. Hence, the sampling design itself is stochastic. (A design which does not de- 
pend on y is called exogenous, or uninformative.) Let 7 be partitioned in a square K x K 
array of (No X No) blocks. Let 7; be the k-th diagonal block, related to the k-th replica. 
Similarly, let y be partitioned in K No-vectors, such that y’ = CPP VPRT Bey p MAE PRY 
It is assumed that the sampling design depends on y in the following sense: the K pairs 
(T,, 1). <-s3 CT x) are 1.1.4. 

The expectation over all elements of S, conditional on y (or €), plays an important role 
in this paper. It is denoted by E,. Then we define 


TIbESB MET (5) 


It is assumed that I] is known. The diagonal elements of IJ are called inclusion probabilities: 
the probabilities that the population elements are included in the sample. The matrix II is 
partitioned in a square K X K array of (No X No) blocks. Let II, be the k-th diagonal 
block, related to the k-th replica. Note that each II, is stochastic because it depends on yx. 
By the above assumption, the II), ..., II, are i.i.d. The dependence of the II, on y is denoted 
by a function F, such that 


Th, = F (yx) (6) 


for all k = 1, ..., K. It is assumed that F(),) is non-singular for every y,. In other words, 
the inclusion probabilities are always positive. 

This framework and Brewer’s (1979) differ in somewhat. Brewer has no endogenous 
variables and therefore all his Il, are nonstochastic and equal. One may also compare this 
approach with the idea of ‘‘constant in repeated samples”’ in the econometric literature; see 
e.g. Theil (1971, p. 364). 

The stage is now set for the estimation of 6. The stochastic properties of estimators will 
be considered over all pairs (y, 7) € (R™ x S). The corresponding expectation will be 
denoted by E;E,. We shall consider a generalized least square estimator of 8, say B, with 
weights equal to the square roots of the inclusion probabilities, as follows, 


Bowral (Mla Xe )edad Ul ee Sloan: blancs Noel (le ey) 


CATR CAE (7) 


Recall that the matrix II is known. Note that X and y relate to the population, but T ef- 
fectuates summation over the sampled elements. As an alternative to considering Basa 
generalized least squares estimator, assume that all elements of II~! are integer numbers. 
Then, if each observation / in the sample is copied a; | times, B is the ordinary least squares 
estimator applied to this inflated sample. In this view, no square roots of the probabilities 
are involved. See also Hausman and Wise (1981, p. 373). The main theorem of this paper is: 


Theorem 1. Under the assumptions made above ((1), (2) and the distribution of « and 7), 
the generalized least squares estimator 6, defined in equation (7) is consistent for K > o. 


The rest of the section is devoted to the proof of this theorem. The following lemma will 
be used in this proof and the proof of subsequent theorems. 
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Lemma 1. Consider an N-vector z, such that z = 1, ®z, where Zo is some fixed No-vector. 
Consider also an N-vector 7, partitioned such that 7’ = (nj, 73, ..., nx). Each nx has No 
elements. Assume that each n, is a function of Xo, 8 and é,, all functions being the same. 
Then 


i ] R. 
plim (qe Ty) = Z%E: (1), 


K> (8) 
where E; (1) is the expectation of any 7,;, being equal for all k. 
Proof of lemma 1: Consider the expectation of I !7;,n,: 
EeEy (We 'Tene) = Ey (We Ep (Te) 1] = Ee (ne), (9) 
for all k. Since the distribution of 7, is the same for each k, one may write 
EE, (Wg 'Tenk) = Ee (10) (10) 


for all k. Also, the K vectors zjII~!T,m, are i.i.d. Thus, Khintchine’s theorem applies as 
follows, 


1 1 
lim ( —z’II~!T) = plim ( — Zell = EB (2d 
aes i ") oe (a ole Tnx cE, (ZoIl, 717; ) 


II 


EE, (Wy Tin). (11) 
Substitution of (10) in (11) gives the lemma. The proof of theorem 1 is now straightforward. 
Proof of theorem 1: The generalized least squares estimator of the theorem can be written as 

B= (Xt TX) AA iy Be OX We TX) XT Te, (12) 


Thus, 


j 1 5A 1 
plim 6B = B + im (xm) | plim (x0 't) 
K>o@ K>o@ K K>o K 
= B + (X$Xo) 1X50 = B. (13) 


The expression X¢ Xo is formed by repeated application of lemma 1, substituting the col- 
umns of X for both z and n. Notice that E;(X9) = Xo since Xp is a constant. The expres- 
sion X90 is formed by repeated application of lemma 1, substituting the columns of X for 
z and ¢« for n. 


126 Ten Cate: Regression Analysis Using Survey Data 


3. THE ESTIMATION OF THE DISTURBANCE VARIANCE 


The regression model described in Section 2 has two parameters: 6 and o*. Theorem 1 
considered estimation of 8; in this section the estimation of o* will be considered. The result 
of this section is given in the following theorem. 

Theorem 2. The disturbance variance o? is estimated consistently by the weighted sam- 
ple variance of the residuals of y if these weights are equal to the inverse of the square root 
of the inclusion probabilities. 

Proof: The variance estimator of the theorem is 


G7 tae CTY ATi eneae (14) 
with 
é= il ,*T(y — AB). (15) 
Let 
y= ll” “Ty, (16) 
Age Pe, (17) 
and 
€=I1- Te. (18) 
Then 
=F — XB = JF — XXX) NX'S (19) 
and 
BG =P’ (Iy sep R AGA) Alley (AB SE) ly A Oe Aa CXS ee) 


= SEL e0 = gt A CN) Xe (20) 


The first term in the right-hand side (RHS) of (20) converges in probability as follows 


7; 

g 3 

i oeN 

Al 
m 
mi 

Sarg 
| 


: 1 ; 1 : 
plim aie) = plim | cost 'T diag. e)e| 


K>o K>o@ 


WNo(7tno) = Noo. (21) 


II 


Here, diag(e) indicates the diagonal matrix with as the diagonal. Lemma | has been applied 
with ua substituted for z and diag(e)e for 7, using model equation (4). Next, consider the 
second term in the RHS of (20). 
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K>o 


im wT de a ell Fe ies 
= | im (8) | im 7") | plim (.*'t) 

. love zi . 1 3) Bit, 1 By 
= | plim ( —X’II~'Te plim { —X’II~'TX plim ( —X’TI~'Te 


0’(X§Xo) —'0 = 0. (22) 


Pe Rare 
plim | ERR RG 


In the derivation of (22), use has been made of lemma 1 in the same manner as in the 
derivation of (13). The combination of (20), (21) and (22) gives 


1 
plim (2) = uINjos: (23) 


K>@ 


Finally, lemma 1 is applied to the first factor in (14), with c,, substituted both for z and 
n. This gives 


1 
plim (js 'T) = No: (24) 


K>o@ 


With (23) and (24) we have 


plim (67) = o%, (25) 


K>o@ 


which proves the theorem. Finally it may be useful to note, as a corollary of (23), that 


(x) eg (26) 


is also a consistent estimator of 0”. 


4. THE VARIANCE OF @ 


In this section the asymptotic variance of the estimator is given. 
Theorem 3. The asymptotic variance of is given by 


Var (6) SOR) eV ee (27) 
with 


V = E; [diag (ec) 1~'PI~'diag (e)], (28) 
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and 
| rape A My tel Hi is (29) 


The elements of P are the so-called second order inclusion probabilities: the probability 
for any pair of elements of the population of being included in the sample. The diagonal 
of P is equal to the diagonal of II. The rest of this section is devoted to a proof of this 
theorem. 


Proof: Consider the asymptotic distribution for Ko of 


KA (BB yaa 1 XTX ee Le ya | 


BK uLX lee coud a Lane (30) 


Since 


1 
plim (jxm'7x) = X}Xp, (31) 
K>o@ 


the asymptotic distribution of K “(8 — B) is equal to the asymptotic distribution of 6, 
with 


Sime K COX Xn) LA Ilo Te = Kone (XG X0) 9 My Xl Dy ep Kee oy dpe nae) 
k k 


and 


by = (X$Xo) 1 XOMe ' Trex, (33) 


for all kK = 1, ..., K. (See e.g. Rao (1973), p. 122). Since the vector 6, (kK = 1, ... K) are 
i.i.d. and also 


E;E(d,) = (X6Xo) —'XoE Ep (Me ' Trex) 


(XEXo) XGbel le Eee 


(X$Xo) 'XGE: (ex) = 0, (34) 


the variance of 5, say Var(65), is equal for all K and also equal to the variance of the asymp- 
totic distribution of 6 for Ko. This variance can be written as 


Var (5) = EzEp(5,5i) (35) 


for any k€ {1, ..., K}. Since the vectors 6, are i.i.d. this may be rewritten as 
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1 
Var (6) = ie Xu EE, (6,5; ) 


] 
= x (soo) | Exe» ( Dy Xo Teexei rai) | (X9X0) ~! 
k 


= K(X’X) 7! [E-E,(X'l~'Tee’ TI 1X) ] (XX) 7! 


K(X'X)~'X’ (E,E, [diag (e) I~! Tu’ TH! diag (€) ] } X(X"X) 7! 
= K(X'X)~'X" (E; [diag (e)-'B, ( Tw’ T) I~! diag(e)] }X(X’X)~!. (36) 


Division of Var(5) by K gives Var(8) and completes the proof. 


5. A DECOMPOSITION OF VAR (@) 
The variance formula (27) can be rewritten as 
Wat By lor OX ixjotse (XX) OX VEXLYX) (37) 
with 
V* = E, [diag(e) (II~'PII—! — w’) diag(e)], (38) 


using (4). The first term in the RHS of (37) might reasonably be called the £-component 
of the variance of 8. This component would contain all the variance of @ if the whole popula- 
tion was sampled. It is entirely due to the variation in the disturbance ¢ and it is the familiar 
expression for that case. The second term in the RHS of (37) might be called the p-component 
of the variance of @. This component contains the matrices II and P, which describe the 
sampling design. This component looks like the variance formula of the estimator of a total 
or average of a finite population. The theory of such estimators will be discussed briefly 
in the rest of this section, as an aid in the interpretation of the p-component of Var (8). 

Consider a finite population of N elements. (No replica structure is assumed here). Each 
element of this population has a score on some real non-stochastic variable, collected in an 
N-vector x. From this population a sample without replacement is taken. The sample is describ- 
ed by the diagonal matrix 7, as before. Also as before, 


T= E,(T) (39) 
and 


P=E,(Tu’'T), (40) 


the first order and second order inclusion probabilities, respectively. There is no regression 
model here, so II and P are fixed known matrices. Horvitz and Thompson (1952) suggested 
to estimate the population total X’1 by 


X=x'll"'|% (41) 
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Obviously this is an unbiased estimator, in view of (39). The variance of X is 


Var(X) 


F(X P) f= EA Ww pe ae TS Re TA ae 


x (ite PIin ee, (42) 


The last member of equation (42) is the variance formula of the Horvitz-Thompson 
estimator, which can be found in textbooks on sampling, such as Cochran (1977), though 
usually not in matrix format. The expression in parentheses in the last member of (42) is 
equal to the expression in parentheses in (38), the definition of V*. The latter is contained 
in the formula of the p-component of Var(8). Thus, the diagonal elements of the p-component 
of the variance matrix Var(8) can be considered as the £-expectation of the p-variance of 
the Horvitz-Thompson estimator of the row totals of (X’X) —! Xx’ diag (e). These totals 
are the elements of the vector (X’X)~1X’e. 


6. THE ESTIMATION OF VAR(§) 


6.1 The General Case 


In this section the estimation of the asymptotic variance Var (8) is considered. Consistent 
estimation of Var(@) is rather difficult, since this requires knowledge of the relationship F 
between y and the sampling design, as it appears in the matrix V. In practice, only the sampling 
design for the actual values of y will be known. In general, it is difficult to tell from this 
design only, what the design would be like if y took on different values. In a sense not only 
a regression model is involved, but also a model of the designer himself! 

For the moment we assume that the function F is known, and therefore V is a known 
function of X and the parameters of the model. (See Subsection 6.2 for a special case). This 
is expressed as follows. 


V = V(B,07;X), (43) 


It is assumed that V(8, 02; X) is a continuous function. For the sake of brevity, V is defined 
as 


ok cael osrdup ce oul (44) 


where and @ and 6” are consistent estimators of 8 and oa” respectively. The rest of this sub- 
section gives a theorem on consistent variance estimation, and its proof. Consistent estimation 
of Var(8) by var(@) is interpreted here as follows: 


plim Kvar(8) = lim K Var(@). (45) 


K>o@ K>o@ 


Theorem 4. Under the assumptions made above, the asymptotic variance Var() is estimated 
consistently by 


A 


var (8) = (xm'Tx)1x'T (F) 1 AD CVV D « Be). (46) 
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where (V/P) denotes the matrix consisting of the elements of V divided by the correspon- 
ding elements of P. 


Proof: First the structure of V will be considered. Let V be partitioned in a square K x K 
array of (No X No) blocks. The (k, r)-th off-diagonal block of V is equal to 


Ey [diag (ex) Ne 'Ep( Tyee’ T,) diag (e,) } 
= E; [diag (e,) Wy 'E,(T,) uw’ E,(T,) 7 ‘diag (e,) ] 
= E: (Eze, ) = 0, (47) 


using the assumed replica structure of the population and the sampling design. The diagonal 
blocks of V are identical and depend on Xo. Thus, V(8, 07; X) can be written as 


V(B,0°;X) = Ix®Vo(B,07;Xo), (48) 


where Vo (B, 07; Xo) is an No X No matrix function. Together with (1), equation (48) can 
be used to rewrite K Var(@) as follows. 


KVat (8) = 4X6Xo) TX VoXo (XEXoy a (49) 


where Vp denotes Vo(8, 07; Xo). The RHS of (49) is independent of K and therefore equal 
to its limit as K tends to infinity. Next, the LHS of (45) is considered. 


=| 5 -1 
Kvar (8) = (x07) ex’? (5) rx | (jx Tx) : (50) 


Earlier, in the derivation of (13) and (22), use has already been made of 


1 
plim ({x'0'Tx) = AG he (51) 


K>o@ 


It follows from the assumption that V(8, 07; X ) is a continuous function, that 


pin’ Vv) ="; (52) 


K>@ 


where Vy denotes Vo (B, 67; Xo). Using (1), (48) and (52) gives 


1 V 1 Vo 
lim —X’T( — } TX lim — Xo Tel. —.}. T,X, 
plim ( ) p rae | 0 (B) Xo | 


K>o@ sa K>o@ k 


1 Vo 
lim — Xo TT, ( — ) T,X, =X AVX. (53) 
p > [en (Z) 15 0 YoA0 


K>@ 
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Here Py denotes E, (Tyut’ Tx), which is the same for all k = 1, ..., K. The last equality 
sign results from the application of Khintchine’s theorem, since the terms in the second 
summation over k in (53) are i.d.d. with p-expectation equal to Xj Vj Xo. Finally, the com- 
bination of (50), (51) and (53) gives 


plim Kvar(8) = (X$Xo)~'X6VoXo(XIXo) (54) 


K>o 
which is the same expression as the RHS of (49). 


6.2 Stratified Sampling 


In this subsection the computation of the matrix 7( V/P)T is given for a special case: 
(1) the disturbances are normally distributed, and (2) the sampling design is an endogenously 
stratified sampling design, such that the inclusion probability 7; of element / of the popula- 
tion is a function f of only the i-th element of y, say ¥(j. Thus, 


Ti = L(V) (55) 


fori = 1, ..., N. As an example, consider the stratified sample which was shown in Figure 
1. The design contains three strata there. The elements in the middle stratum have the highest 
inclusion probability. Figure 2 shows the corresponding function /f. 


S(y) 


Ly Ly 


Figure 2. The Probability Function f Corresponding to Figure 1 


In general, let there be H strata, indicated by h = 1, ..., H. Let the boundaries of these 
strata be Lo, Li, ..., Ly. Typically, Lo = —o and Ly = +. Let mn) be the inclusion 
probability of the population elements in stratum h. More formally, the function f(-) is 
such that f(y) equals m(,) if L,-1 = ¥ < Lp. The values of 7, and L, are usually known 
in practice, since the actual sampling design depends on their values. 
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In stratified sampling, the second order inclusion probability of any two population 
elements not in the same stratum equals the product of their respective first order inclusion 
probabilities: their inclusions in the sample are independent. For any two population elements 
in the same stratum this holds approximately. Thus, approximately the off-diagonal elements 
of P are equal to the off-diagonal elements of Iu.’II. The diagonal of P is equal to the diagonal 
of II, as before. Thus, approximately, 

P = Wu’ — 14 11. (56) 
Then 
V = E; [diag(e)(’ — 7 + I~!) diag(e) ] 
= E,[ee’ — diag*(e) + diag? (e)II-!] = EF; [diag?(e) I~", (57) 
in view of assumption (4). Thus V is a diagonal matrix here. Then 


r(Z)r- TTI ~'E; [diag? (e)1~!], (58) 


which is also a diagonal matrix. Now consider a population element i, which is included in 
the sample. Then, using (58) and assuming normally distributed disturbances, 


Vv jogo a pusttieag th 

I7(5) r| = — ” Twi | v (£367) e7de; 
P eaten Coun) ; 
Lh—-1-x{ 

RD) H-1 
1 1 1 “ 
wile ft i 34 ) ¥en — x80) }. (59) 
Ti CTA) =f NEGO MGS 


Here, $(-;67) indicates the normal density with mean zero and variance 62. The function 
W(-) is defined as 
x 


V(x) = | v(ssl)e2de = B(x) — xe (x1), (60) 


—oco 


where $(-) denotes the cumulative density function for the standard normal distribution. 
In the derivation of (59), use has been made of V (Lo) = 0 and W (Ly) = 1. 


7. MODEL-FREE REGRESSION 


7.1 Consistent Estimation 


As a digression from the main theme of this paper, model-free regression will be con- 
sidered in this section. Firstly, model-free regression can be usefully applied in the case of 
doubt about the validity of a linear model. See Fuller (1975), who studies model-free regres- 
sion for some specific designs. Van Praag (1981, 1982) studies model-free regression in the 
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case of repeated sampling from some probability distribution. See also DuMouchel and 
Duncan (1983). White (1980b, Section 3) studies related problems. Secondly, the so-called 
regression estimator of a population total uses model-free regression. See textbooks such 
as Cochran (1977), the review papers mentioned above by Nathan (1981) and Smith (1981) 
and Bethlehem and Keller (1983). 

The purpose of model-free regression is the estimation of the population parameter vector 


b= (XLOr Xs: (61) 


without assumptions about the probability distribution of y. In fact, both X and y are con- 
sidered non-stochastic. Further, the same replica structure as in Section 2 is used, as follows. 


X= tk®Xo, (62) 
and 


yY=tx®@yo, (63) 


where yo is some fixed No-vector. As before, the K diagonal matrices T, (k = 1, ..., K) 
are i.i.d. These matrices describe the sample as in Section 2. Together the matrices 7, form 
the matrix J. No additional assumptions are made concerning the distribution of 7. 

It is proved relatively easily, along the same lines as in Section 2, that the weighted estimator 
8 defined before in (7), is a consistent estimator of b defined in (61). See also Jonrup and 
Rennermalm (1976), who indicates 8 as an ‘‘approximately unbiased”’ estimator of b, and 
Van Praag (1982, Section 4d), where ‘‘selectivity bias’’ with known inclusion probabilities 
is studied for the model-free case. 

It follows in the same manner as in Section 4 that in the model-free case the asymptotic 
variance of 6, say Varur(8), equals 


Varue (8) SGX EX), (64) 

with 
e=y— Xb, (65) 
V = diag (e)I1~'PI~'diag (e), (66) 


and with P defined as before in (29). Notice that V in (66) differs from V in (28) in the omis- 
sion of the £-expectation and the substitution of e for e. 

It is interesting to rewrite Vary (8) in the same way as Var(8) was rewritten in Section 
5. In doing so, use will be made of 


er, (67) 


which follows directly from (61) and (65). The Varyr(8) can be rewritten as 


Varyr(B) = (X’X) ~|X'diag(e) (it pi 11 diag (2) XC) OF 
dec(eXv tied Aikeew Oxted 


(X'X) ~ x’ diag (e) (IL !PIT~! —w’) diag (e)X (XX) “!. (68) 
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The last member of (68) corresponds with the p-component of the decomposition of Var(@) 
in (37). It may be concluded from (68) that in model-free regression the variance of the 
estimator of the regression coefficients consists of the p-component, while the €-component 
vanishes. 

Notice finally that, using the discussion at the end of Section 5, the last member of (68) 
can be written as 


2 AAU B Bair (69) 


where the matrix © is the p-variance-covariance of the row totals of X’diag(e). A similar 
result was reached by Binder (1983, Section 4), though along different lines. 


8. DISCUSSION 


In this section some practical considerations are given concerning the use of weights in 
regression analysis. Several motives for the use of weights are discussed shortly, related to 
the preceding technical sections of this paper. 

First of all, it must be noted that the difference between weighted and unweighted regres- 
sions may be of some significance. An important example is the case where business firms 
are the unit of study - either farms, industrial enterprises of any other kind of business firms 
varying considerably in the number of employees. At the Netherlands Central Bureau of 
Statistics, for instance, the classification by number of employees is a standard stratification 
variable in sampling designs of business firms, giving a considerable range of inclusion proba- 
bilities - the large units chosen with relatively large probabilities. In studies with employment 
as the endogenous variable, such a sampling design is endogenous, which calls for weighted 
regression; the large units receiving small weights. 

Secondly, in the case of units varying widely in size, a major problem with regression 
analysis is the heteroscedasticity of the error term. This calls for weighted regression, of the 
same sort as the weighting due to an endogenous design discussed in Section 2: large units 
receiving small weights. 

Finally, there is a third motive for the weighting of sampled data: the notion of a model 
free regression, as discussed in Section 7 above. Again, the weights here are of the same 
sort as the weights in Section 2. 

Summing up, there seems to be no reason not to incorporate the sampling design in 
regression analysis. 


9. CONCLUSIONS 


In this paper the estimation of a regression model with survey sample data has been studied. 
In particular, samples drawn with an endogenous design have been studied; for example, 
a sample stratified on the endogenous variable. It has been shown that for such a sample 
the weighting of the observations with the inverse of the square root of the sampling frac- 
tions gives a consistent estimator. The concept of consistency used here is a modification 
of Brewer (1979). The asymptotic variance of the estimator has been given, as well as a 
consistent estimator of this variance. The variance is the sum of a sampling component and 
a model component. 
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Also, model-free regression has been considered. Model-free regression requires the same 
weighting as endogenous stratification. The variance of the estimator of the model-free regres- 
sion coefficients contains only the sampling component, and not the model component. 

Finally, some practical considerations relative to the weighting of the data have been given. 
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A Cluster Analysis of Activities of Daily Living 
From the Canadian Health and Disability Survey! 


D.A. BINDER and G. LAZARUS2 


ABSTRACT 


The Canadian Health and Disability Survey, administered as a supplement to the Canadian Labour 
Force Survey in October 1983, collected data on potentially disabled persons by means of a screening 
questionnaire and a follow-up questionnaire for those screened-in. The data from the screening ques- 
tionnaire, consisting of a set of activities of daily living, were used to group respondents according 
to identifiable characteristics. A description of the groups of respondents is provided along with an 
evaluation of the methods used in their determination. An incompletely ordered severity scale is proposed. 


KEY WORDS: Disability scale; Discriminant analysis. 


1. INTRODUCTION 


Considerable efforts have been made to acquire a better understanding of the disabled 
population. These efforts have focussed on the development of a useful vehicle for captur- 
ing the potentially disabled population as well as the analysis of survey data for the purposes 
of gaining a better understanding of the various dimensions of disability and to develop useful 
measures of severity. Examples of papers which examine these issues are Dolson ef al. (1984) 
and Raymond ef a/. (1981), among others. This paper chronicles the development of an ex- 
ploratory technique in order to gain a better understanding of the disabled population in 
Canada. In particular, a cluster analysis based on results of several discriminant analyses 
was performed. 

The next section presents information about the Canadian Health and Disability Survey. 
The third section describes the development of the clusters. Section 4 focusses on the 
characterization of the clusters. Some analysis of the behaviour of the derived clusters is 
given in Section 5. The paper concludes with some closing remarks. 


2. BACKGROUND 


In response to a need for data on disabled persons in Canada, Statistics Canada under- 
took a program to create a disability database. The Canadian Health and Disability Surveys 
(CHDS) were administered as supplements to the Canadian Labour Force Survey (LFS) in 
October 1983 and June 1984. In both cases, separate questionnaires were administered to 
children and to adults. In the October survey, the adult questionnaire was administered to 
everyone in the LFS sample (the frame includes about 97% of the Canadian population ag- 
ed 15 or more). In June, the adult survey was restricted to those aged 15 to 64 from the 
six provinces with the smaller sample sizes in October (i.e. Newfoundland, Prince Edward 
Island, Nova Scotia, New Brunswick, Manitoba and Saskatchewan). Children from all pro- 
vinces were surveyed in both October and June. 


! This is a revised version of the paper presented at ASA meetings, Social Statistics Section, Las Vegas, August 1985. 
2 D.A. Binder and G. Lazarus, Social Survey Methods Division, Informatics and Methodology Branch, Statistics 
Canada, 4th Floor, Jean Talon Building, Tunney’s Pasture, Ottawa, Ontario, Canada, KIA OT6. 
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This paper concentrates on work which utilized only the data from the adults question- 
naire in October 1983. This survey obtained 92,945 adult respondents from approximately 
47,000 households. 


2.1 Questionnaire 


2.1.1 Screening Section 


The Labour Force Supplement included a screen which was used to identify respondents 
for a follow-up questionnaire. The screening section consisted of nineteen items - seventeen 
activities of daily living, an activity limitation item and an item about mental handicap. The 
activities of daily living (ADL’s) are a set of activities which any person would perform dur- 
ing the course of his/her regular living pattern. The set used here was a modified version 
of those developed by the Organization for Economic and Co-operative Development (OECD) 
and has been utilized by several other countries. 

The ADL’s are presented in Table 1 with the questionnaire identification and the orienta- 
tion of the specified activity. Two ADL’s are related to hearing troubles, two to vision troubles, 
four to mobility troubles, one to speaking and being understood and the remaining eight 
to agility troubles. 


Table 1 
Activities of Daily Living 


aT 


RaSHARDEHR Description Orientation 
ag el a a aE al a se a es ee Bo 
Al0 Walking 400 Metres Mobility 
All Walking up and down stairs Mobility 
Al2 Carrying 5 kg. object for 10 metres Mobility 
Al3 Moving from one room to another Agility 
Al4 Standing for long periods Mobility 
Al5 When standing, bending down to pick up 
object Agility 
Al6 Dressing and undressing Agility 
Al7 Getting in and out of bed Agility 
Al8 Cutting own toenails Agility 
Al19 Using fingers to grasp or handle Agility 
A20 Reaching Agility 
A21 Cutting own food Agility 
A22 Reading newsprint Vision 
A23 Seeing clearly a face across the room Vision 
A24 Hearing conversation with another person Hearing 
A25 Hearing conversation with two or more persons Hearing 
A26 Speaking and being understood Speaking and 


being understood 


i 
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An example of the wording of these questions in the screening section of the question- 
naire is as follows: (A20) Does . . . . have any trouble reaching? The activity limitation item 
(A27) concerned limitation ‘‘in the kind or amount of activity he/she can do at home, at 
work or going to school because of a long-term physical condition or health problem’’. The 
final item in the screen section (A28) concerned mental handicap. 

It should be noted that the survey was concerned with long-term conditions or health pro- 
blems - those that had lasted or were expected to last more than six months (excluding pregnan- 
cy). An individual was screened in if he/she had trouble with at least one of the ADL’s, 
the activity limitation item or had a mental handicap. (Proxy responses were required for 
mentally handicapped individuals). 


2.1.2 Follow-up Section 


The follow-up section of the questionnaire was completed for individuals selected by the 
screening section. This section included an item which sought to determine if the respondent 
was completely unable to perform the ADL(’s) he/she had trouble with. Other segments of 
the follow-up questionnaire pertained to: nature of the disability (related to trouble seeing 
or reading, trouble hearing, trouble speaking and being understood, and mobility); problems 
related to the ability to work or the workplace itself; obstacles to education and availability 
of special educational facilities; problems related to local and long-distance travel; and pro- 
blems in current residence and special facilities. The information in the follow-up question- 
naire, given above, could be used to analyze the cluster characteristics, or to develop a severity 
index (see Lazarus; 1985a, 1985b). 


3. CLUSTERS 


This section presents a description of the procedures used in the development of the clusters. 
The clustering procedures employed were developed specifically for this application. Technical 
details concerning the methods used are given in Sections 3.2 and 3.3. All computations were 
performed using SAS. 


3.1 Methodology 


This section summarizes the methodology used to derive the final clusters. The clustering 
procedure consisted of two steps: 
a) a divisive step, where the 12,907 individuals were sequentially partitioned using PROC 
CANDISC. 
b) an agglomerative step, where the partition was collapsed. 


For the divisive step, the following procedure was employed iteratively. First, the starting 
point put all the observations into a single cluster. Each step subdivided each of the current 
clusters into two groups. For each of the current clusters, a canonical correlation analysis 
was performed by taking each non-constant variable as a grouping variable and using all 
other non-constant variables as explanatory variables. The cluster was then split into two, 
based on the discriminant analysis with the largest F-value. In this way the determinant of 
the between-sums-of-squares matrix is maximized. 

For the agglomerative step, subjective criteria were used, based on the magnitude of the 
F-value, the size of the groups and the plots of the points. Collapsing was accomplished in 
the reverse order of splitting, for the most part. 
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For the divisive step, data based on both unweighted and weighted covariances were used 
separately. The results were essentially the same. It was decided to continue without the sampIl- 
ing weights because of the added complexity which would be incurred by their inclusion. 
Furthermore, the weights were not expected to be important with respect to the characteristics 
of the clustered individuals. Inclusion of weights is necessary for evaluation and analysis. 


3.2 Description 


The cluster analysis was a procedure which grouped together those screened in respondents 
with similar but not necessarily identical ‘‘profiles’’. For our purposes, a respondent’s pro- 
file consisted of the responses to the seventeen ADL’s (yes, has trouble/no, does not have 
trouble), responses to the major activity limitation item (positive/ negative), and the mental 
handicap item in the screening section of the questionnaire. 

Table 2 details the final clusters. The symbols U and Z demonstrate how the groups are 
defined. The symbol U means that the group is defined through that variable being one, 
i.e. 100% by definition. The symbol Z is used when the defining screening section item is 
zero, i.e. 0% by definition. Note that six of the nineteen screening items are not used ex- 
plicitly in the process of classifying respondents. These are Al], A13, A18, A20, A23 and A24. 


4. CLUSTER CHARACTERIZATION 


This section explores the ways and means of identifying the clusters. The concepts of ‘‘trou- 
ble orientation’? and ‘‘umbrella’’ group are introduced and the clusters are ranked accor- 
ding to the severity of disability. 


4.1. Trouble Orientation 


Threshold values were established to assist in the cluster classification process. The values 
were chosen by ordering the clusters according to orientation and locating an obvious gap 
in the E(NADL) for the orientation, where E(NADL) referred to the average number of 
troubles among ADL’s A10 - A26. In general, a cluster was recognized as having trouble 
with an activity orientation when the E(NADL) for a particular orientation exceeded the 
established threshold value. For example, for mobility orientation, E(/NADL) was computed 
for activities Al0, All, Al2 and Al4. The E(NADL) for each cluster over each orientation 
may be found in Table 3. 

Clusters were labelled as follows. If a cluster had trouble with an activity, the correspon- 
ding letter was included in the label. Two clusters, containing individuals who had trouble 
speaking and being understood or were mentally handicapped, were ‘‘special’”’. Clusters which 
had neither mobility nor agility troubles exceeding the established values were so designated 
with an N. For example, HMA1 and HMA2 refer to clusters with a large proportion having 
hearing, mobility and agility problems, but no particular problem with vision. Alternative- 
ly, VNI refers to a cluster with the exact opposite set of problems. 


4.2 Umbrella Groups 


Clusters with similar orientation patterns became members of specified ‘‘umbrella’’ groups, 
where they could be better compared using E(NADL) within the umbrella. Table 4 shows 
the clusters according to the ‘‘umbrella’’ groups to which they belong. 
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Table 2 
Cluster Analysis Results 


All Al2 Al3 Al4 AlS Al6 Al7 Al8 Al9 
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Cluster 
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Table 3 


Average Number of Troubles by Orientation 


Cluster Hearing Vision Mobility Agility Total 

1 15733 0.657 3.624 5.841 11.855 
3 jeri: 1.508 Saki 2.170 8.566 
3 1.637 0.034 3.274 2.543 7.488 
4 1.579 0.016 2.582 0.710 4.887 
5 1.596 1.463 0.625 F211 4.895 
6 1.509 0.014 1.091 2052 4.766 
7 1.605 0.013 0.253 0.246 27 
8 0.012 0.493 32172 7.203 11.480 
9 0.054 1.304 3.643 4.480 9.841 
10 0.000 0.005 3.686 5.256 8.947 
11 0.006 0.018 3.476 3.319 6.819 
12 0.044 1.456 3.445 2.838 7.783 
13 0.012 1.427 2.653 0.884 4.976 
14 0.009 0.010 3a127 3.178 6.924 
15 0.021 0.021 3.776 2.941 6.759 
16 0.004 0.000 2.406 1.964 4.374 
17 0.000 0.000 2.625 2.083 4.708 
18 0.000 0.000 2.890 1.890 4.780 
19 0.002 0.005 3.404 0.494 3.905 
20 0.004 0.007 2.046 0.233 2.290 
oa 0.026 1.411 0.467 0.852 2.756 
22 0.014 0.014 1.088 3.498 4.614 
23 0.007 0.008 0.984 1.688 2.687 
24 0.000 0.000 0.068 0.352 0.482 
25 0.000 0.003 1.685 0.587 B23 
26 0.005 0.007 0.303 0.258 0.573 
Pa 0.003 0.003 0.310 1.170 1.486 
28 0.005 0.000 O2172 1.285 1.462 
29 0.057 0.065 0.650 0.418 1.190 


 ————— 


4.3 Severity 


One area of analytic interest is the development of an index of severity of disability. The 
notion has been considered previously by Raymond et al, among others. 

The index of severity would be useful in as much as it would allow for simple comparisons 
of disability among the screened-in respondents. The use of E(NADL) to draw such com- 
parisons presumes that the orientations are self-weighting, noting, for example, that two 
ADL’s are devoted to hearing troubles while four are devoted to mobility troubles. Also, 
the multidimensional nature of severity of disability is hidden by a single score such as 
E(NADL). 
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Table 4 
Ordering of Clusters by ‘‘Umbrella’’ Groups 


SS i i se 


Umbrella Sample 
Core Cluster Cant E(NADL) ID 
a lee 
HV (Hearing/Vision) 2 187 8.566 HVMAI1 
5 203 4.895 HVNI1 
H (Hearing) 1 303 11.855 HMAI1 
5 355 7.488 HMA2 
4 BL! 4.829 HM1 
6 289 4.760 HAI 
7 1,770 2.120 HN1 
V (Vision) 9 56 9.841 VMAI 
12 160 7.783 VMA2 
13 164 4.976 VM1 
21 618 2.756 VNI1 
S (Special) 17 24 4.708 SMAI 
24 246 0.482 SN1 
MA (Mobility/Agility) 8 245 11.480 MAI 
10 210 8.947 MA2 
11 166 6.819 MA4 
14 187 6.924 MA3 
15 677 6.759 MAS 
M (Mobility) 16 458 4.374 M2 
18 173 4.780 Ml 
19 582 3.905 M3 
20 857 2.290 M4 
A (Agility) Dips 215 4.614 Al 
N (Neither) 23 1,164 2.687 Nl 
25 295 2.213 N2 
26 1,923 0.573 N6 
PA aad 1.486 N3 
28 204 1.462 N4 
29 494 1.190 NS 


ao ee nn 


Table 4 presents an ordering of clusters according to “‘severity’’ within umbrella groups. 
This within group ordering better reflects the notion that severity is multidimensional than 
would an overall ordering. 


5. CLUSTER CHARACTERISTICS 


The principal components technique was used to examine the behaviour of the resulting 
clusters. Raymond et al also employed principal components; the main difference being that 
analysis here is based upon group means rather than individuals. 


5.1 Methodology 


We considered a subset of screened in cases, where more information per case is available. 
In particular, we added the responses to questions of the form: (B101) Is . . . completely 


146 Binder and Lazarus: Cluster Analysis of Activities of Daily Living 


unable to walk 400 metres without resting? This line of questioning was used for each of 
the ADL’S, A10-A26. Thus, 11,412 of the original 12,907 individuals who were screened 
in were usable. The other 1,495 were dropped because of non-response problems. These “‘com- 
pletely unable’ items were coded with ‘‘1’’ when the individual indicated that he/she was 
completely unable to perform the specified ADL, otherwise , a “‘0’’ was coded. 

The means were obtained for the nineteen screening items and seventeen follow-up items 
for each cluster. The means for the completely unable items were then multiplied by the ratio 
of the overall average number of ADL’s to the overall average of completely unable items 
in order to scale them consistently and to avoid the scaling problems associated with prin- 
cipal components analysis. 

Principal components were obtained using the nineteen screening section and seventeen 
follow-up item means as variables, using the ‘‘clusters’’ as observations and weighting ac- 
cording to cluster size. The clusters were then ordered according to each of the first four 
principal component scores. 

The final stage involved the pooling of cluster cases according to ‘“‘umbrella’’ group 
membership and finding the means of the first four principal component leadings for each 
of the eight ‘‘umbrella’’ groups, where the weights were the numbers of members in the “‘um- 
brella’’ groups. 


5.2 Results 


We present the results in two stages. In the first stage, we examine the principal com- 
ponents and attempt to label them according to the scores. We also explore the ‘‘umbrella’’ 
group construct in terms of the principal component means. In the second stage, we examine 
the ordering of the clusters according to the first four principal components. 


5.2.1 Components 


The first four principal components for the nineteen screening section items and the seven- 
teen follow-up items explained just over seven-eighths of the total variance and appeared 
to be most useful for our purposes. 

The loadings of the first principal component are positive on all but four items (A24, 
A25 and B24] are hearing oriented, A28 is mental handicap). The negative loadings are close 
to zero. This first component appears to be an overall measure of strength. The first prin- 
cipal component explained nearly 66% of the total variance and is denoted as ‘““OVERALL”’. 

There are negative loadings on Al0, All, Al2, Al4 and A15 of the second component. 
The loading for A15 is nearly zero, however. Loadings are positive for ADL’s with an agility- 
trouble orientation as well as for hearing-trouble and vision-trouble orientations. It appears 
then that this component polarizes mobility trouble against agility, hearing and vision troubles. 
The second component is labelled ‘‘AHV/M”’’. 

The third principal component has positive loadings for mobility and hearing oriented 
ADL’s and negative loadings for agility and vision oriented ADL’s. This third component 
is denoted ‘‘MH/AV’’. 

The fourth principal component has positive loadings for mobility and vision oriented 
ADL’s and negative loadings for agility oriented ADL’s. This fourth component is designated 
“MV/A’’. 


5.2.2 Mean Loadings 


Table 5 presents the average differences of the principal component scores from the mean 
scores over all 11,412 individuals, for each of the eight ‘‘umbrella’’ groups. We can 
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Table 5 


Average Differences of Principal Component 
Scores from Mean Scores 


Differences 

Sample PRINI PRIN2 PRIN3 PRIN4 

Umbrella Group Count (Overall) (AVH/M) (MH/AV) (MV/A) 
Se eae ESTE Sh er i 
Hearing/Vision 346 0.68 1.26 0.61 1.06 
Hearing 2741 — 0.33 0.54 0.81 — 0.25 
Vision 888 0.30 0.69 — 0.76 | OP 
Special 151 —1.02 — 0.04 — 0.47 — 0.06 
Mobility/Agility 1311 ENS | — 0.33 —0.21 —0.33 
Mobility 1893 0.30 — 0.80 0.18 0.33 
Agility 195 —0.19 0.31 — 0.80 —0.78 
Neither 3887 —1.11 —0.16 —0.41 —0.22 


SE 


now check to see if the incomplete ordering presented earlier is consistent with the results 
from the principal components analysis. We note the following observations are taken from 
Table 5. 
i) The mobility/agility ‘‘umbrella’’ group has the highest difference on the first principal 
component “‘overall’’, while the ‘‘umbrella’’ group “‘neither’’ has the lowest difference. 
The difference for the hearing/vision group is positive as is the mean for the vision 
group. The hearing group difference is negative, however, evidence that individuals 
with hearing-oriented troubles tend not to have other disabilities. There may be an 
in—clination to draw the same kind of conclusion with respect to agility-oriented 
troubles. It is observed that the mobility/ agility and mobility groups have positive dif- 
ferences while the agility ‘“‘umbrella’’ group has a negative difference. However, in this 
case, the result is somewhat ambiguous because the agility-oriented ADL’s included 
speaking trouble (A26), a so-called ‘‘special’’ trouble area and it is clear indeed that 
the special ‘“‘umbrella’”’ group has a negative difference for the first principal component. 
ii) The second component set mobility-oriented troubles (-) against agility, hearing and 
vision-oriented troubles (+). Positive differences are recorded for the hearing/vision, 
hearing, vision and agility ‘‘umbrella’’ groups while negative differences are associated 
with the mobility/agility, mobility and neither groups, as expected. The difference for 
the special groups is nearly zero. 
iii) The third component set mobility-oriented and hearing-oriented troubles (+ )against 
agility-oriented and vision-oriented troubles (—). Again, the results are consistent. 
iv) The fourth principal component set mobility and vision-oriented troubles (+) against 
agility-oriented troubles (—). The results are again consistent with the umbrella-group 
construct. 


5.2.3 The Scales 


Table 6 shows the ranks of the clusters according to the first four principal component 
scores and E(NADL). Recall that the component loadings are for 11,412 cases and utilize 
follow-up information as well as screening section information while the E(NADL) scale is 
based on 12,907 cases and uses screening information only. 

The cluster ranking according to principal components was done as follows. The compo- 
nent representing overall strength (OVERALL) ranked clusters from highest to lowest scores. 
The ranking of clusters on AHV/M tended to put clusters with mobility-oriented troubles 
at the bottom end as opposed to clusters with agility, hearing or vision oriented troubles 
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which were ranked higher up on this scale. The ranking of clusters on MH/AV tended to 
put clusters with mobility or hearing troubles at or near the bottom of the scale while clusters 
with agility or vision-oriented troubles were ranked higher. Finally clusters with agility-oriented 
troubles were ranked higher on MV/A than the others. Given the bipolar nature of com- 
ponents 2, 3 and 4, it was necessary to make an arbitrary decision as to a trouble orientation 
scale. As cluster 8 had shown itself to be highly severe according to the E(NADL) scale, 
it was determined that cluster 8 should be similarly ranked along the other scales. 

For most clusters, the rankings fluctuate over a wide range. This reflects the nature of 
the criteria upon which the scales were based. The first principal component, which provides 
an overall measure of strength, may be the most suitable candidate for ranking the clusters. 
Firstly, it incorporates the screening section information used in the development of the 
E(NADL) measure. As a result, the rank orderings provided by the OVERALL and E(NADL) 
scales are quite similar. The additional follow-up information used in the construction of 


Table 6 


Cluster Rank According to Alternative Scales 


PRIN1 PRIN2 PRIN3 PRIN4 
Cluster ID (Overall) (AHV/M) (MH/AV) (MV/A) E(NADL) 
Tih tt JER) Si) AO QOS TS itis SES eS eee 
2 HVMAI 9 4 27 28 5 
b) HVNI1 22 ys pags 25 12 
1 HMAI1 3 3 24 6 1 
3 HMA2 10 14 28 10 ¥ 
4 HM1 16 15 29 20 13 
6 HAI 20 8 25 3 15 
7 HNI1 hy) if 26 9 24 
S) VMAI 2 6 4 wR) 3 
2 VMA2 a) 10 Hi 27 6 
13 VM1 1S 11 rs Pa 11 
an VN1 23 5 2 26 20 
8 MAI 1 1 1 1 Z 
10 MA2 5 20 13 4 4 
14 MA3 6 24 16 7 8 
11 MA4 if 23 17 8 9 
Ds) MAS 8 28 20 18 10 
18 Ml 14 26 19 21 14 
16 M2 15 pa 18 Vy 18 
ty M3 11 29 23 24 19 
20 M4 18 rf | 21 pap Peps 
Pah Al 17 9 3 2 17 
23 Nl 21 17 6 5 21 
25 N2 19 DB. 10 16 23 
27 N3 24 19 15 12 25 
28 N4 28 12 9 11 26 
29 NS 25 16 12 15 27 
26 N6 26 18 8 14 28 
17 SMAI 12 vA 14 19 16 
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this component leads us to believe that OVERALL is better than other scales such as 
E(NADL). It is worth noting that the ranking was done on all 29 clusters and depicted in 
Table 6 on an ‘“‘umbrella’’ group basis. The ‘“‘umbrella”’ group information was not incor- 
porated into the principal components analysis, however. 


6 CLOSING REMARKS 


A clustering technique was employed to group screened-in individuals according to similar 
screening section profiles. The clusters were then ordered according to the information con- 
tained in the screening section of the questionnaire (the incomplete ordering based on 
E(NADL) and presented in Table 4) and finally according to information contained in the 
screening and follow-up sections of the questionnaire (the OVERALL scale presented in Table 
6). This last scale is deemed presently to be the most suitable of those considered here. 
However, it could be argued that no single index of severity exists and in fact the severity 
index should be defined as a 4-dimensional scale corresponding to our principal components. 
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Additive Versus Multiplicative Seasonal Adjustment When 
There Are Fast Changes in the Trend-Cycle! 


GUY HUOT and NAZIRA GAIT? 


ABSTRACT 


The seasonal adjustment of a time series is not a straightforward procedure particularly when the level 
of a series nearly doubles in just one year. The 1981-82 recession had a very sudden great impact not 
only on the structure of the series but on the estimation of the trend- cycle and seasonal components 
at the end of the series. Serious seasonal adjustment problems can occur. For instance: the selection 
of the wrong decomposition model may produce underadjustment in the seasonally high months and 
overadjustment in the seasonally low months. The wrong decomposition model may also signal a false 
turning point. This article analyses these two aspects of the interplay between a severe recession and 
seasonal adjustment. 


KEY WORDS: Decomposition models; ARIMA; Lead-lag relationship. 


1. INTRODUCTION 


1981 and 1982 were atypical years afflicted by a severe recession. This recession has pro- 
foundly affected the evolution and structure of economic time series, and consequently their 
seasonal adjustment. Seasonally adjusted time series are necessary to diagnose the socio- 
economic health of a country. In turn, social and economic policies founded on these data 
influence decisions in both the private and public sectors. Thus, this recession raises many 
questions. One can readily see that a prompt examination of seasonal adjustment is necessary. 

The series under consideration here are: initial and renewal claims received (for unemploy- 
ment benefits) and beneficiaries. It is difficult to see how their trend and cycle components 
evolve when they are contaminated by seasonal variation, namely intra-annual climatic and 
institutional factors. Seasonal adjustment permits a better detection of fundamental tenden- 
cies, such as turning points, and evaluation of the present performance of the economy. 

This article analyses some aspects of the interplay between a severe recession and seasonal 
adjustment. In just one year, that is in 1981, this recession has nearly doubled the level of 
beneficiaries. Such a sudden large change prompts questions about the structure of the series, 
the choice of the X-11-ARIMA decomposition model, the determination of turning points 
at the end of the series, and the use of ARIMA forecasts for seasonal adjustment. 

In section 2, we discuss two important consequences of using a wrong decomposition model, 
namely a systematic over- and under-adjustment of series and the possibility of having a false 
turning point at the end of the series. In section 3, we use the lead-lag relationship between 
the claims and beneficiaries series to help seasonally adjust the latter series. 

The ARIMA forecasts generally help to reduce the revision to the seasonal factors and 
they can help to provide a more accurate recognition of the turning points at the end of the 
series. Section 4 considers this question. 


' This paper was presented at the 145th Annual Meeting of the American Statistical Association, Las Vegas, 
Nevada, 1985. 


2 Guy Huot, Time Series Research and Analysis Division, Statistics Canada. N. Gait, University of Sao Paulo, Brazil, 
was visiting Statistics Canada when the paper was written. 
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2. DECOMPOSITION MODELS FOR SEASONAL ADJUSTMENT 


Most of the claims and beneficiaries series have similar characteristics, so we have chosen 
to study one claims series and one beneficiaries series which can clearly illustrate some of 
the problems peculiar to seasonal adjustment during a severe recession. It should be noted 
that the results of our analysis are equally valid during a sudden strong expansion in the 
economy. It is the sudden large change in the level of the series caused by the recession or 
the expansion that is important. 

The X-11-ARIMA program (Dagum 1980) will be used to seasonally adjust these series. 
The program is applied to the claims and beneficiaries series, using data from January 1973 
and May 1975 respectively, up to February 1983. 

The X-11-ARIMA program provides three decomposition models for the estimation 
of the time series components. The program assumes an additive relationship between the 
components 


O; = LG, a S; a I, (225) 
or a multiplicative one 
O, = 1G Sei; (222) 
or a log additive one 
log O,; = logTC, + log S; + log J; (2:5) 


where O stands for the observed and unadjusted series; TC, the trend-cycle; S and J, the 
seasonal and irregular components; and f¢, the time. 

Seasonal adjustment means removing the seasonal variations S, from the raw data O,, 
thus leaving a seasonally adjusted series consisting of TC, and /,. In order to know whether 
a certain series contains a significant amount of seasonality and if so, whether an additive 
or multiplicative model provides the better fit, one can perform a test for the presence of 
seasonality and a model test on the series (Higginson 1977). The first test shows that both 
series contain a very significant amount of seasonality. According to the second test, the 
multiplicative model fits the beneficiaries series better when tested from May 1975 to June 
1981. When the series is extended to February 1983, taking into account the impact of the 
recession on the series, the additive model then fits better. On the other hand, the model 
test favours neither the additive nor the multiplicative model for the claims series. 

One usually adjusts the series using only one model, however, figure 1 shows the 
beneficiaries series adjusted using the two models, both without using the ARIMA option. 
During 1980 and 1981, the difference between the additive and multiplicative adjustments 
was small compared with the difference observed in 1982. 

The multiplicative model assumes that the seasonal variation is proportional to the level 
of the trend-cycle. During 1982, the seasonal amplitude did not increase in this way. Conse- 
quently, using the multiplicative model is likely to overestimate it from June to November, 
the seasonally low months. As figure 1 shows, in underestimating the number of seasonal 
beneficiaries, the multiplicative model has drastically overestimated the number of seasonal- 
ly adjusted beneficiaries. The converse is also true. 
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Figure 1. Beneficiaries 


The additive model, on the other hand, does not assume that the components of the series 
evolve proportionately. Figure 1 confirms that the trend cycle increased while the seasonal 
amplitude remained constant. Thus, the additive model provides the better seasonal adjustment. 
It performs better in 1982 than the multiplicative model and is acceptable in 1980 and 1981. 

By mid-1982, it was not easy to tell which of the additive or multiplicative models would 
adjust the beneficiaries series better. Since this series was adjusted multiplicatively until June 
1981, one would normally continue to do so in 1982. During 1982, were there some clues 
or pieces of evidence showing that the multiplicative model was no longer adequate? 

The acceptance or rejection of model, given a sudden large change in the level of a series, 
clearly has to be based on a thorough analysis of the data. The set of quality control statistics 
included in the X-11-ARIMA program is not meant to detect that kind of problem in the model. 
In this experiment with the multiplicative model, none of the ten individual control statistics 
failed the guideline. However, the F test for the presence of moving seasonality showed the 
presence of increasing moving seasonality during 1982 in the final unmodified SI ratios. 

Besides a systematic over and under-adjustment of the series, another consequence of using 
a wrong decomposition model is the possibility of having a false turning point at the end 
of the series. 
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Let us say that a cyclical turning point has occurred if the seasonally adjusted series shows 
a change in direction that persists for at least 5 months. Once the beneficiaries series has 
been seasonally adjusted multiplicatively, figure 1 shows the possible presence of a turning 
point around October 1982, where the upward trend has suddendly changed to a downward 
trend. This turning point seems to be confirmed when the series ending in December 1982 
is extended by one month. The additively adjusted series, on the other hand, shows no turn- 
ing point. The two results conflict. Thus, either the multiplicative model is signaling a false 
turn or the additive model is missing the turning point. 

It is not that easy to show that the multiplicative model has signalled a false turn. The 
multiplicative model has created a turning point around October 1982. Table 1 shows that 
in the very short run, the updating of the series did not reverse this turning point. 


Table 1 


Multiplicatively Adjusted Beneficiaries Series 
(in thousands, July 1982 - February 1983) 


July Aug. Sept. Oct. Nov. Dec. Jan. Feb. 
124 131 140 

124 130 140 142 

124 130 140 14] 141 

124 130 140 142 138 131 

123 130 140 142 141 131 121 

123 129 139 142 141 134 121 Ps 


3. LEAD-LAG RELATIONSHIP BETWEEN THE CLAIMS AND 
BENEFICIARIES SERIES 


Leading indicators are sensitive to the evolution of the economic climate. They are measures 
of anticipations or new commitments, and as such they give an advance indication of changes 
expected in the trend-cycle of coincident and lagging indicators. 

Figure 2 shows the claims series as a leading indicator for the beneficiaries series. The 
performance of the seasonally adjusted indicators can be tested using the criteria of Klein 
and Moore (1982). The two series satisfy these criteria. First, the correspondence between 
the series is one-to-one - the number of cycles is the same in each series. Second, there is 
uniformity in timing - the claims series always lead. Third, these are monthly series and they 
are current, or up-to-date. Thus, the claims series is likely to predict an upward or a downward 
change in the trend of the beneficiaries series. 

The lead-lag relationship between the two series can help to seasonally adjust the 
beneficiaries series. It reduces the likelihood of mistaking an irregular turn for a cyclical tur- 
ning point. Figure 2 shows September 1982 to be a turning point in the multiplicatively ad- 
justed claims series. This is also true for the additive adjustment of the series. Since the 
cross-correlations between the two series shows a lead-lag relationship of 5 to 6 months, the 
September 1982 turning point in the claims series indicates that the multiplicative model ap- 
plied to the beneficiaries series has signalled a false turn around October 1982. However, 
the leading indicator predicts a turning point around March 1983 in the beneficiaries series. 
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Figure 2. Claims and Beneficiaries. The Number of Beneficiaries has been Divided by 3 in Order 
to Make the Scale of Both Series Compatible. 


4. ARIMA EXTRAPOLATIONS 


An optimal seasonal adjustment procedure has to minimize the revision to the current 
seasonal factors and also has to produce reliable estimates of the trend-cycle, particularly 
of turning points, at the end of the series (Dagum 1979). The analysis carried on in the previous 
sections is based on seasonally adjusted data without using the ARIMA option. In this sec- 
tion, we shall focus on the use of the ARIMA forecasts as a variable that can provide an 
accurate recognition of the turning points. 

The automatic X-11-ARIMA program proceeds as follows: 


1. Three univariate ARIMA models of the general multiplicative form (p,d,¢) (P:D,O), 
(Box and Jenkins 1970) are fitted to the monthly or quartely series that is to be seasonally 
adjusted. The models are 

(0,154). (0,1, 1)5 

(0,2,2) (0,11), 

(2,152) (0,151); 
when the series is seasonally adjusted additively. For series adjusted multiplicatively, the 
same models are used and the log transform is applied to the data for the first two models. 
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2. The series is extrapolated one year in advance; and 
3. provided the extrapolations are acceptable, the ordinary X-11 method is then applied 
to the series thus extended. 


Figure 3 shows the beneficiaries series seasonally adjusted both additively and 
multiplicatively, using the automatic X-11-ARIMA options. The ARIMA models that best 
fit and forecast the series ending in December 1982 are (0,2,2) (0,1,1)). when the series is 
seasonally adjusted additively and log (0,2,2) (0,1,1);2 when adjusted multiplicatively. The 
log (0,2,2) (0,1,1),;. model has forecast a decrease in the series, while the (0,2,2) (0,1,1)j2 
model has maintained the upward trend. 

Figure 3. shows the multiplicative seasonal adjustment of the beneficiaries series using 
both the upward trend and the downward trend extrapolations. One can see from the com- 
parison of figure 1 with figure 3 that ARIMA extrapolations did not modify the multiplicative 
estimates of the trend-cycle in the last year. The multiplicative model is still signalling a turning 
point around October (downward trend, log transform). The multiplicative model applied 
to either the non-extended beneficiaries series (figure 1) or to the extended series is questionable. 

By the end of 1983, one could see that the true turning point has actually occurred around 
February 1983. Thus, the October or November 1982 turning point can hardly be corrected 
by extrapolation when it is due to the wrong selection of the decomposition model. 


Thousands 
1,800 


1,700 
1,600 
1,500 
1,400 
1,300 
1,200 
7,100 
1,000 

900 

800 

700 


1982 1983 
Time 


Original Series 

Extrapolations Using Log (0,2,2)(0,1,1) 

Extrapolations Using (0,2,2)(0,1,1) 

Multiplicative Adjustment Using ARIMA Extrapolations with Log (0,2,2)(0,1,1) 
Additive Adjustment Using ARIMA Extrapolations with (0,2,2)(0,1,1) 
Multiplicative Adjustment Using ARIMA Extrapolations with (0,2,2)(0,1,1) 


a9 TUE) 0 2 


Figure 3. Beneficiaries Series Seasonally Adjusted Additively and Multiplicatively with Different 
ARIMA Extrapolations 
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Over and under-adjustment and problems of identifying the turning points occurred in 
other series as well. Figure 4 shows for instance, the series of ““benefits paid’’ when seasonally 
adjusted multiplicatively with actual d ata available to the end of 1984. The seasonally 
adjusted series tends to oscillate systematically around the trend-cycle curve at the turning 
point, thus over- and underestimating the benefits paid. After the turning point, the oscilla- 
tion decays to the trend-cycle curve; showing that the multiplicative model is doing poorly 
around the turning point. Note that this series has strong trading-day-variation which has 
also been removed. 


5S. SELECTION OF THE OPTIMAL SEASONAL 
ADJUSTMENT PROCEDURE 


Figure 5 summarizes the criteria for seasonal adjustment that have been taken into ac- 
count to overcome the problems due to the interplay between the 1981-82 recession and 
seasonal adjustment of the beneficiaries and claims series. The selection of the best seasonal 
adjustment procedure was primarily based on the first criterion. 

In order to avoid over- and underestimation and false turning points in the seasonally 
adjusted figures, the appropriate decomposition model has to be selected. A thorough analysis 
of the data should be conducted by: 


1. performing a model test on the series. 

2. adjusting the series both additively and multiplicatively if the effort is justified. If the 
differenc e between the two adjustments becomes significant as in figure 1, one has to 
check for underadjustment in the seasonally high months and for overadjustment in 
the seasonally low months. One can also look in table D8 of the X-11-ARIMA program 
at the F tests for the presence of stable and moving seasona lity. The decomposition 
model that better adjusts the series will usually show the higher F value for stable 
seasonality and the lower F value for moving seasonality. 

3. checking for turning points. For the claims series, both decomposition models have signal- 
ed a turn in August or September 1982. On the other hand, for the beneficiaries series, 
only the multiplicative model has signalled a turn in October 1982. Thus either the 
multiplicative model is signalling a false turn or the additive model is missing the turning 
point. The analysis has shown this turn to be a false one resulting from the drastic over- 
estimation of the number of seasonally adjusted beneficiaries in the seasonally low months 
as shown in Figure 1. 

4. using a bi- or multivariate approach to accurately estimate the turning points at the end 
of the series. The lead-lag relationship between the claims and beneficiaries series can 
help to seasonally adjust the beneficiaries series. It reduces the likelihood of mistaking 
an irregular turn for a cyclical turning point. Since the lead is about 5 to 6 months, 
the September 1982 turning point in the claims series confirms that the multiplicative 
model applied to the beneficiaries series has signalled a false turn in October 1982. 
However, the leading indicator is predicting a turning point around March 1983 in the 
beneficiaries series. 

5. using the ARIMA option with concurrent seasonal factors. It usually gives smaller revi- 
sions to the seasonal factors wheth er an additive or a multiplicative seasonal adjust- 
ment is made. However, a false turning point can hardly be corrected b y extrapolations 
when it is due to the wrong selection of the decomposition model. 
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Figure 4. Benefit Paid (Seasonally Adjusted Multiplicatively) 
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Figure 5. Optimal Seasonal Adjustment Procedure 
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6. checking both the raw and seasonally adjusted data. One cannot rely on tests only. For 
instance, the set of quality control statistics included in the X-11- ARIMA program is 
not meant to detect under- or overestimation of the series or false turning points. 

7. all the above recommendations apply if the series is not strongly affected by trading- 
day-variation. If trading-day-variation is present, then it must be removed before the 
ARIMA option is used. 
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Nonresponse Adjustment Procedures at the 
U.S. Bureau of the Census 


DAVID W. CHAPMAN, LEROY BAILEY, and DANIEL KASPRZYK! 


ABSTRACT 


Nearly all surveys and censuses are subject to two types of nonresponse: unit (total) and item (partial). 
Several methods of compensating for nonresponse have been developed in an attempt to reduce the 
bias associated with nonresponse. This paper summarizes the nonresponse adjustment procedures used 
at the U.S. Census Bureau, focusing on unit nonresponse. Some discussion of current and future research 
in this area is also included. 


KEYWORDS: Nonresponse adjustments; Imputation; Missing data; Weighting. 


1. INTRODUCTION 


The Bureau of the Census has long recognized the potential seriousness of measurement 
errors ascribed to survey nonresponse, and has consistently incorporated nonresponse ad- 
justment or compensation procedures in the estimation methodologies for its numerous and 
varied surveys and censuses. The objectives of this paper are to provide an overview of pro- 
cedures employed by the Census Bureau in compensating for nonresponse, primarily unit 
nonresponse. By unit nonresponse we mean that little or no information for the principal 
survey variables is obtained for the sample unit in question. 

This presentation will include (1) a discussion of the general weighting scheme used for the 
demographic surveys; (2) a review of some of the distinct problems associated with non- 
response in the Survey of Income and Program Participation (SIPP); (3) a discussion of 
the handling of unit nonresponse for the economic surveys and censuses; and (4) a section 
On imputation for earnings for the Current Population Survey. In addition to providing 
descriptions of the various nonresponse compensation methods used by the Census Bureau, 
the authors will cite specific problems associated with those methods and note the Bureau’s 
current nonresponse research activities and concerns. 


2. NONRESPONSE IN DEMOGRAPHIC SAMPLE SURVEYS 


At any given time, the Bureau of the Census may be involved with the conduct of 25-30 
recurring or special demographic surveys. The concerns of these surveys include labor force 
participation, individual and family income, health care, transportation, leisure activities, 
crime, and other topics reflective of the current interests of the nation’s people, governments, 
businesses, and institutions. Unit nonresponse rates for these surveys range from between three 
and four percent for the National Crime Survey to over 25 percent, which was recorded for 
the 1984 National Survey of Natural and Social Scientists and Engineers. 


' David W. Chapman and Leroy Bailey are Principal Researchers, Statistical Research Division, U.S. Bureau of 
the Census, Washington D.C. 20233. Daniel Kasprzyk is a Special Assistant, Office of the Chief, Population 
Division, U.S. Bureau of the Census, Washington D.C. 20233. 
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Weight adjustment within classes (Oh and Scheuren 1983), or cell balancing, is the predomi- 
nant technique used to compensate for unit nonresponse in the Census Bureau’s demographic 
surveys. There is variation among the surveys relative to the determination of adjustment 
classes within which weighting occurs. For some surveys, ancillary data available to define 
weighting classes are limited to basic geographic and survey design information, while for 
others a considerable amount of demographic and economic data are accessible. 

The nonresponse adjustment factors for the Bureau’s demographic surveys are usually 
the inverse of the survey’s weighted or unweighted response rate. In a small number of the 
surveys this adjustment factor is modified slightly to reflect information gleaned from follow- 
up subsamples of the initial nonrespondents. Since the Census Bureau’s general approach 
to survey nonresponse is essentially the same for all of its major demographic surveys, a 
general description will be given in Section 2.1 of the nonresponse adjustment procedure for 
the National Crime Survey (NCS), as the example of a “‘typical’’ Census Bureau application 
of weighting. Section 2.2 will consist of a discussion of alternative procedures and current 
unit nonresponse research in the demographic areas. 


2.1 The National Crime Survey 


The NCS sample is a national probability sample of about 72,000 households which is 
divided into six panels, each of which is interviewed in a given month and again at six-month 
intervals over three years. The survey focuses on measuring household crimes and the extent 
of victimization of household members age 12 and older by assault (including rape), burglary, 
larceny, auto theft, and robbery. [For a detailed description of the NCS, see U.S. Depart- 
ment of Commerce, Bureau of the Census (1977). ] 

Estimates for the NCS, which are produced quarterly, are derived by initially inflating 
the sample data by the inverse of the related selection probabilities. The noncontacts and 
refusals account for about three to four percent of the survey’s occupied units in any given 
month. Adjustments for these units are made by applying adjustment factors to the weighted 
respondent data in weighting classes. An attempt is made to define these classes in sucha 
way that the respondents and nonrespondents in each class have similar survey characteristics. 
In order to temper the impact of the nonresponse adjustment on the variance of the survey 
estimates, some of the smaller weighting classes generally have to be collapsed with other 
classes before a final nonresponse adjustment can be effected. Collapsing of classes also takes 
place if the weight adjustment factor becomes too large for one or more classes. [See Hanson 
(1978).] Collapsing is discussed further in Section 4. 

Since the NCS employs a self-response method of interviewing, there is concern about 
the amount of within household nonresponse. Consequently, a separate set of weighting 
cells exists to compensate for within-household nonresponse. These cells or weighting 
classes, as well as those used for the household nonresponse adjustment, are indicated in 
Tables 1-3. The NCS household and within household nonresponse rates for 1984 are shown 
in Table 4. 

To illustrate the NCS estimator of a total, there is a selection probability z; = 1, 2, ..., N, 
associated with each of the N units in the population. It is assumed that among the n sample 
units, 2g are respondents. The NCS estimator for the population total, after adjusting for 
unit nonresponse, takes the following form: 


M Ie NRik y; 
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Table 1 


NCS Noninterview Adjustment Cells for 
Within Household Nonresponse 


Persons by Age, by Race of Head 


Household 
Relationship 


Head of Household 
Wife of Head 
All other Persons 


Table 2. Table 3 
NCS Household Noninterview NCS 
Adjustment Cells for Household Noninterview 
Standard Metropolitan Adjustment Cells for 
Statistical Areas (SMSA’s) Non-SMSA’s 


Balance of 
Central SMSA 
City of 


Race 


a 
Not whe | 


where for sample units in the k” within household and j” household weighting classes, 


Yjke = value of the £ th sample respondent, 
Ngjk = number of sample respondents, 


Njx = number of sample cases, 
Z; = the estimated household response rate, 
u, = the estimated within household response rate, 


Tjxe = Selection probability for the ? th sample respondent, 
total number of within household nonresponse weighting classes, 
M = total number of household nonresponse weighting classes. 


v 
II 


Implicit in the formation of the NCS nonresponse weighting classes, as well as those for 
other demographic surveys, are the following assumptions: 


1. There is ‘‘significant’’ correlation between the major survey variables and the covariates 
used to define noninterview adjacent classes. 

2. Within each household nonresponse weighting class, E( Irj) = E(Ip;), where Yrj and 
YR; are the means for the sample respondents and nonrespondents, respectively, in the 
j” weighting class. 

3. The weighting class means differ, that is, E(Yp;) # POR st eI: 


(Assumptions analogous to 2 and 3 above are also implicit for within household nonresponse 
adjustment classes.) 
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Table 4 
NCS Noninterview Rates - 1984 


a eS TIL 


Average 
1984 Jan. Feb. Mar. Apr. May June 


Pernt Cr a ee a ee 


Household Noninterviews 


Total Interviewed HH’s 11,769 11,916 11,925 11,743 11,809 11,918 9,482 
Total 430 446 540 481 446 388 348 
Rate 335 2.6 4.3 38) 3.6 ay. 3.5 
No one at home 0:9 0.8 bel 0.9 0.9 0.7 1.0 
Temporarily Absent 0.6 0.6 0.6 0.8 0.6 0.4 0.7 
Refusal 1.9 2.1 2.6 apd Daz 2.0 1.9 
Other 0.1 Wn 0.2 0.1 Oat 0.1 0.1 


Within Household Noninterviews 


Total 685 655 TS1 701 806 804 697 
Rate Zs 2.6 3.0 2.8 3.0 ie, 3.2 


July Aug. Sept. Oct. Nov. Dec. 


Household Noninterviews 


Total Interviewed HH’s 9,869 9,446 9,895 9,350 9,692 9,410 
Total 41] 409 337 406 387 346 
Rate 4.0 4.2 abe) 4.2 3.8 325 
No one at home 0.9 0.9 0.6 1.0 i 1.0 
Termporarily Absent EO 1.0 0.6 0.6 0.4 0.4 
Refusal SA 2R3 2:0 2.4 mal aol 
Other 0.1 0.1 0.1 0.3 0.3 0.1 


Within Household Noninterviews 


Total 709 678 666 728 ye 803 
Rate Si! Spy Sap 3.4 a a! Sul 


CUS tS 


The selection of weighting classes for this procedure is constrained by the requirement 
that measurements for the weighting class variables (covariates) must be available (either 
before or during the survey) for both the respondents and the nonrespondents. This essen- 
tially restricts the characteristics by which classes are defined to those associated with 
geography, race, urbanicity, housing unit characteristics, and design levels. The bias reduc- 
tion capability of the procedure depends, in part, on the extent to which the NCS nonresponse 
weighting classes satisfy the three assumptions given above. No definitive results relating 
to this concern are currently available, but relevant research is underway and more empirical 
studies seem warranted. 


Survey Methodology, December 1986 165 


2.2 Alternatives to Sample Weighting 


There are a number of plausible alternatives to weighting to adjust for nonresponse. See, 
for example, Little (1986, Section 5). However, there are no definitive results which show 
that any of them offer appreciable advantages. Subsections 2.2.1 and 2.2.2 contain brief 
descriptions of two alternatives which are currently being investigated for application to 
demographic surveys. 


2.2.1 Separate Estimates for Dissimilar Types of Nonresponse 


In demographic surveys, nonrespondents can be placed into four categories: refusal (REF), 
not-at-home (NAH), other occupied unit (OTO), or a unit from which a response was not 
obtained due to extenuating circumstances. These are referred to as type A noninterviews. 
The NAH group can be divided into those households or individuals whose extended absence 
from their homes precludes an interview during the scheduled interview period (NAHz,), and 
the group which is expected to return home sometime during the survey period (NAH:s). 

The authors are not aware of any data which show that the four nonresponse groups are 
generally similar. In fact, the Census Bureau’s Current Population Survey and the Canadian 
Labour Force Survey suggest that the NAH+s households are likely to be smaller, younger, 
and have a larger proportion of employed people than the other groups. The NAH, group 
is usually older with a relatively low employment rate. The interviewed group may be more 
reflective of the REF and OTO groups. [See Palmer and Jones (1967) and Paul and Lawes 
(1982).] It is conceivable that separate treatment of the four nonresponse groups could pro- 
duce a better overall adjustment for nonresponse than is obtained from the current procedure. 
This option is being investigated by an NCS nonresponse adjustment research group. 


2.2.2 Weighting With Response Probabilities 


Several weighting techniques have been advanced which make use of the concept of response 
probabilities. Most of these techniques are based on concepts introduced by Politz and 
Simmons (1949) which group sample respondents according to estimates of their probabilities 
of responding. The factors with which the sample data in the resultant weighting groups are 
inflated are the inverses of the estimated response probabilities. The Politz-Simmons pro- 
cedure has some serious limitations, such as its inapplicability to refusals. However, there 
have been a number of fairly recent extensions and applications of the procedure, including 
those presented by Anderson (1978), Thomsen and Sirling (1983). These methods may be 
applicable to recurring surveys for which extensive callbacks are made. 

Research is in progress regarding the development of models which may be used to estimate 
response probabilities for several demographic surveys for units with similar values of the 
““independent variables.’’ The feasibility and merits of computing nonresponse adjustment 
factors, as well as constructing weighting classes based on such models (sometimes referred 
to as response propensity stratification), are being examined. [See Rosenbaum and Rubin 
(1983) and Little and Samuhel ( 1983). ] Moreover there are continued efforts to develop more 
objective methods of sample weighting for nonresponse, which are designed to control 
nonresponse-related errors. 


3. THE SURVEY OF INCOME AND PROGRAM PARTICIPATION 


The Survey of Income and Program Participation (SIPP) is a new, ongoing national 
household survey program of the U.S. Bureau of the Census. The purpose of SIPP is to 
improve the measurement of information related to the economic situation of households 
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and persons in the United States. It is the culmination of a large-scale development program, 
the Income Survey Development Program (ISDP), which examined concepts, procedures, 
questionnaires, recall periods, and the like. For a description of the ISDP, see Ycas and Lin- 
inger (1981). Data from SIPP are expected to be useful in studying the Federal transfer system, 
estimating program costs under changes in program eligibility rules, evaluating the effects 
of program changes on selected population subgroups, as well as studying changes to the 
tax system. 

In October 1983 SIPP began as an ongoing survey program with one sample panel of 
approximately 21,000 occupied households eligible for interview in 174 Primary Sample Units 
(PSU’s) selected to represent the noninstitutional population of the United States. (Beginning 
in 1985 a new panel is being introduced in February of each year; the 1985 panel consisted 
of 14,500 households eligible for interview.) 

Each household is interviewed once every four months for approximately 2'2 years to 
produce sufficient data for longitudinal analysis while providing a relatively short recall period 
for reporting monthly income. The reference period for the principal survey items is the 4 
months preceding the interview. This design provides eight interviews per household, and 
allows cross-sectional estimates to be produced from more than one panel. 

To facilitate field and processing operations, each sample panel is divided into four ap- 
proximately equal subsamples, called rotation groups; one rotation group is interviewed in 
a given month. Thus, one cycle or ‘‘wave’’ of interviewing, using the same questionnaire, 
takes four consecutive months. Cumulative household noninterview rates are given in Table 
5 for the 1984 SIPP panel. 

At the time of the interviewer’s visit, each person 15 years old or older who is present 
is asked to provide information about himself/herself; a proxy respondent is asked to pro- 
vide information for those who are not available. An important design feature of SIPP is 
that all persons in a sample household at the time of the first interview remain in the sample 
even if they move to a new address during the next 2'2 years. For cost and operational 
reasons, in-person interviews are only conducted at new addresses that are within 100 miles 
of a SIPP primary sampling unit. The geographic areas defined by these rules contain over 
96% of the U.S. population. An attempt is made to conduct a telephone interview with those 
moving outside the 100-mile limit. 


Table 5 


Cumulative Household Noninterview Rates 
for the 1984 SIPP Panels 


Wassomritourne bast mncaety gotntzivy aimee 
1 4.9% 
2) 9.4% 
3 12.3% 
4 15.4% 
5 17.4% 
6 19.4% 
‘| 21.0% 
8 22.0% 
9 22.3% 
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After the first interview, the SIPP sample is a person-based sample, consisting of all in- 
dividuals who were living in the sample unit at the time of the first interview. Individuals 
aged 15 and over who subsequently share living quarters with original sample people are 
also interviewed in order to provide the overall economic context of the original sample 
persons. 

More detailed information concerning the SIPP design, content, and operations can be 
found in Nelson, McMillen, and Kasprzyk (1985). 


3.1 Nonresponse Adjustments in SIPP 


Data collected in SIPP can be viewed from two perspectives: cross-sectional or longitudinal. 
From the former point of view, each SIPP interview is treated as a separate cross-sectional 
survey, providing point-in-time estimates. For examples of these estimates, see U.S. Depart- 
ment of Commerce, Bureau of the Census (1984a). From the longitudinal point of view, 
data are collected at more than one point-in-time, and the survey record is viewed not as 
a set of unrelated observations, but as a set of variables with logical dependency between 
two or more points-in-time. Data processing operations, as well as statistical estimation, are 
treated from this point of view, and therefore, rely on the use of data collected at two or 
more interviews. 

Since SIPP can be viewed from both the longitudinal and cross-sectional perspectives, 
SIPP’s public-use microdata files include cross-sectional data files issued on a wave-by- 
wave basis as well as longitudinal files. This implies two distinct systems to treat survey 
nonresponse. 


3.1.1 Cross-Sectional Unit Nonresponse Adjustments 


The cross-sectional unit nonresponse adjustment in SIPP is similar to the way noninter- 
view adjustments are made in other Census Bureau recurring surveys. The following variables 
were used to define household noninterview adjustment cells for the first interview wave of 
SIPP. See U.S. Department of Commerce, Bureau of the Census (1983 and 1984b). 


1. Census Region - Northeast, Midwest, South, West. 

2. Residence - Standard Metropolitan Statistical Area (SMSA), non-SMSA. 
3. Place/not place - defined for units not in an SMSA, 

Central city/balance - defineds for units in SMSA’s. 

Race of reference person - black, non-black 

Tenure - owner of home, renter. 

Household size - 1, 2, 3, 4 or more. 

Rotation group - 1, 2, 3, 4. 


Two criteria must be met by each weighting class: (1) the weighting class must contain 
at least 30 unweighted units and (2) the noninterview adjustment factor for a weighting 
class must be less than or equal to 2.0. For a given rotation group, the collapsing 
procedure to satisfy these two criteria is applied independently for each of the four tenure 
by race combinations. (For the first wave, there was no within-household nonresponse 
adjustment factor.) 

In subsequent waves of SIPP, the household nonresponse adjustment factor accounts for 
noninterviews associated with units which have moved and cannot be located or have moved 
more than 100 miles from a SIPP PSU and cannot be contacted by telephone as well as units 
which are refusals, etc. Adjustments are performed for each month of the reference period, 
as well as the interview month, to account for an increase in the number of noninterviews 
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caused by splits of sample households. The procedure is similar to that described for 
determining the Wave | household nonresponse adjustment factor; however, the variables 
used to define the weighting classes differ. Those variables are: 


1. Race (white, nonwhite) and Spanish-origin (Spanish, non-Spanish) of reference person: 
a) reference person is white and not Spanish, and b) others. 


2. Household type - three categories: a) female householder, no husband present, with 
own children under 16, b) householder’s age is sixty-five years or older, and c) others. 


3. Education level of the reference person: a) less than 8 years, b) 8-11 years, c) 12-15 
years, and d) 16 or more years. 

4. Type of income received (using the most recently completed interview for members 
of the household) - two categories: a) households which received at least one of the 
following sources of income - Supplemental Security Income; Black Lung Payments; 
Aid to Families with Dependent Children; General Assistance, Indian, Cuban, or 
Refugee Assistance; foster child care payment; Women’s, Infants’, and Children’s Nutri- 
tion program; Food Stamps; and Medicaid; and b) others. 


5. Assets — two categories: a) households in which at least one member held an asset type 
other than a savings account or an interest-bearing checking account, and b) all others. 


6. Tenure: a) owner of home and b) renter. 


7. Public housing or rent subsidies--renters are identified as a) those living in public housing 
projects or receiving rent subsidies from the government; and b) those not living in 
public housing projects and not receiving rent subsidies from the government. 


8. Household size: 1, 2, 3, 4 or more. 


The variables used for household nonresponse adjustments for the second and subsequent 
SIPP interviews differ from the first wave variables because of additional data available 
after the first interview for use in nonresponse procedures for later interviews. Fifty-three 
weighting classes were created using these variables with tenure as the principal variable for 
partitioning the sample. [For a description of these weighting classes see U.S. Department 
of Commerce, Bureau of the Census (1984c).] Although a cell collapsing strategy has been 
defined which merges cases in cells exhibiting similar poverty-related characteristics, little 
collapsing takes place since the nonresponse adjustment factors are calculated for three 
rotation groups (the SIPP data processing cycle) rather than one rotation group, as in the 
first interview. 

There is a within-household nonresponse compensation procedure for the second and subse- 
quent waves. This procedure is to ‘‘hot deck’’ (i.e., duplicate) the entire record of a sample 
respondent who presumably has survey characteristics that are similar to those of the 
nonrespondent. 


3.1.2 Longitudinal Nonresponse Adjustments 


Since persons identified as living at the sample address at the time of the first interview 
constitute the SIPP sample for waves subsequent to the first, the most useful and logical 
way of describing the nature of the SIPP nonresponse problem from the longitudinal view- 
point is in terms of individuals or persons. Each individual’s microdata record is an extended 
record containing variables which oftentimes reflect the same measure at different points 
in time. Thus, in a panel survey of n waves there exist 2” possible noninterview patterns for 
a sample person. Noninterview patterns of the original sample persons for the first five in- 
terviews (waves) of the 1984 panel are given in Table 6, adapted from Kalton, McMillen, 
and Kasprzyk (1986). 
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Table 6 


Interview patterns of the Original Sample Persons for the First Five Interviews 
of the 1984 SIPP Panel 
a 
Response Pattern Percent 


a 


Response every interview (5 interviews) 


Pattern: XXXXX 79.1 
Apparent attrition cases 13.8 
Patterns: XXXXO 3.8 
XXXOO 3.1 
XXOOO 3.2 
XOOOO aor 
First and fifth interviews conducted, but one and more interven- 
ing interview missing 4.1 
Patterns: XXXOX 1.6 
XOXXX 0.6 
XXOXX P22 
XXOOX 0.1 
XOXOX 0.1 
XOOOX 0.3 
XOOXX 0.2 
Fifth interview missing and one or more intervening interviews 
missing 0.7 


Patterns: XOXXO, XOXOO, XOOXO, XXOXO 


Left the universe (deceased, institutionalized, living in armed 


forces barracks, moved overseas) 3: 
Total 100.0 
(25,128) 


ST a a a re es a ee ee 


The first SIPP longitudinal microdata file will contain twelve months (three interviews) 
of data from the 1984 SIPP panel, with the individual as the principal analytic unit. The 
sample of cases to be weighted for this file will be only those persons with three completed 
interviews. Those sample persons with only one or two interviews will be treated as 
nonrespondents. Their reported data will help to define nonresponse adjustment classes. 

Since the first microdata longitudinal file contains only persons responding to all three 
interviews, the nonresponse adjustment issue is virtually the same as for the cross-section 
case. There are, however, two nonresponse adjustment factors applied to the initial sampling 
weights. See Kobilarcik and Singh (1986). The first adjustment factor accounts for households 
classified as noninterviews in the first interview wave. The second factor accounts for persons 
who did not supply all three interviews. 

For the first adjustment factor, only those household variables available at the first 
interview can be used. Adjustment factors are calculated separately within cells defined by 
the following variables: 


Census Region 

. Residence (metropolitan, non-metropolitan) 
Race of reference person 

. Tenure (own, rent) 

Household size 


eno o8 
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The second set of adjustment factors is implemented on a person basis. The factors are 
calculated within cells defined by the following characteristics: 


Monthly household income 

Program participation status of the person’s household 
Labor force status 

Race 

Years of school completed 

Type of assets of person’s household 


mo ao#p 


Cells are collapsed whenever they do not contain thirty sample persons or the nonresponse 
adjustment factor exceeds 2. 

As the survey progresses, more sophisticated methods of adjusting for longitudinal 
nonresponse will be developed which make use of the data provided for partial respondents 
(i.e., for sample persons that provide some, but not all, of the interview waves requested). 
It is not obvious how to treat the partial response cases. Data gaps associated with persons 
who miss one or more interviews can be viewed as either person nonresponse, and typically 
handled by weighting adjustments, or as item nonresponse, usually handled by some type 
of imputation method. For example, one might consider an individual with a (R, NR,R) pat- 
tern as a case of item nonresponse since the missing interview is bounded on both sides by 
completed interviews; but one might consider an individual with an (NR,R,NR) pattern as 
total unit nonresponse, treating it the same as (NR,NR,NR). However, we need to recognize 
that even in the case of the response pattern (R, NR,R) for an individual, four kinds of response 
patterns are still possible at the item level. Thus, many options can be considered when 
developing nonresponse compensation procedures for the SIPP longitudinal data base. This 
issue is discussed by Kalton (1986) and by Kalton, Lepkowski, and Lin (1985). 


3.2 SIPP Research Activities 


There are two areas where work has recently begun which should aid future decisions con- 
cerning nonresponse adjustments. First, the SIPP questionnaire, beginning during the fourth 
interview, contains a ‘‘Missing Wave’’ section. This section uses a short series of questions 
on labor force participation, income sources, and asset ownership/nonownership for 
respondents in the current wave who did not respond in the preceding wave. Respondents 
who miss two or more consecutive interviews are not eligible to complete the ‘‘Missing Wave’’ 
section. By emphasizing data collection at the expense of minor reporting burden, the per- 
son nonresponse problem can be reduced to an item nonresponse problem. An evaluation 
of the quality of the retrospective data will be necessary prior to using these data. 

The second area of work concerns general strategies in the treatment of person-wave 
nonresponse in the SIPP. Graham Kalton and his colleagues at the Survey Research Center 
will (1) compare longitudinal imputation and weighting strategies for handling person-wave 
nonresponse, (2) evaluate imputation and weighting models in terms of the analysis of change 
across waves and aggregation across waves, and (3) develop preliminary criteria for the choice 
of method for treating person-wave nonresponse. A discussion of these and other issues which 
will be studied can be found in Kalton (1986), and Kalton and Miller (1986). 

Finally, there are several other research topics for which work is planned. These include: 
(1) quantifying the selection of variables used for determining weighting classes; (2) assessing 
the robustness of the survey estimates on the population and selected subgroups under dif- 
ferent nonresponse compensation procedures, and different weighting class cell collapsing 
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strategies; (3) investigating the potential for making separate nonresponse adjustments by 
type of noninterview; (4) investigating the effect of deleting reported survey data to simplify 
the nature of the SIPP missing data problem; and (5) evaluating the longitudinal nonresponse 
compensation procedures adopted for the first SIPP longitudinal research file. 


4. UNIT NONRESPONSE PROCEDURES FOR ECONOMIC 
CENSUSES AND SURVEYS 


The Bureau of the Census carries out six economic censuses every five years, the most 
recent ones covering 1982. These six economic censuses are identified by the following 
trade areas: 


(1) Retail Trade 

(2) Wholesale Trade 
(3) Service Industries 
(4) Manufactures 

(5) Mineral Industries 
(6) Construction 


In addition to the economic censuses, the Census Bureau carries out the Census of Govern- 
ments and the Census of Agriculture. Though not part of the economic censuses, they are 
conducted during the same years as the economic censuses for processing efficiencies and to 
allow for data linkage. In nearly all of these economic areas the Census Bureau also carries 
out a number of monthly, quarterly, and annual surveys. 

Like the demographic areas, there is some unit nonresponse for all of the economic censuses 
and surveys. In most cases, missing data are imputed based on (a) previous responses provided 
by the nonrespondent, (b) data from administrative records, and (c) relationships established 
between various data items. Rather than reporting the percent of units not responding, the level 
of nonresponse for an economic census or survey is usually given as the percent of one or more 
item totals that are imputed. These percents will be referred to as imputation rates. 

Explanations of the unit nonresponse methods used for five of the six economic censuses 
are given in Section 4.1. Section 4.2 addresses unit nonresponse procedures for three economic 
surveys, and Section 4.3 covers such procedures for the Census of Agriculture. Research 
and evaluation activities with regard to nonresponse procedures for economic censuses and 
Surveys are discussed in Section 4.4. 

More detailed explanations of the nonresponse procedures used in these censuses and several 
related surveys are given by Bailey, Chapman and Kasprzyk (1985). 


4.1 The Economic Censuses 


The frame for the economic censuses is the Standard Statistical Establishment List (SSEL), 
a computer file maintained by the Census Bureau. The SSEL is comprised of all employer 
establishments reported by multi-unit employer companies in the Census Bureau’s Company 
Organization Survey (COS) and all single-unit companies that filed a tax form with IRS. The 
COS is an annual survey of multi-unit companies. Companies that have at least 50 employees 
are surveyed each year, while companies with fewer than 50 employees are surveyed every 
three years. Each company in the COS is sent a list of the establishments it reported most 
recently in the survey and asked to update the list. They are also asked to provide, for each 
establishment, employee counts for the first quarter of the previous year and total payroll 
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for the previous year. For the economic censuses, each establishment on the SSEL, except 
small single-unit establishments, is sent a census questionnaire (via its company) designed 
for its standard industrial classification (SIC) code. 

Although there are many similarities among the unit nonresponse procedures used 
in the six trade areas, some important differences exist. In the following description 
of the unit nonresponse adjustment procedures used for five of the economic censuses, 
the trade areas that use essentially the same procedure will be grouped together 
as follows: 


(a) Retail trade, wholesale trade, services 
(b) Manufactures, mineral industries 


4.1.1. Retail Trade, Wholesale Trade, Service Industries 


These three parts of the economic censuses are often referred to collectively as the 
business census. For these trade areas, data for the census year are collected on sales 
receipts, employment, and payroll. The imputation rate for sales/receipts varies from 
10 to 15 percent for retail and wholesale trade and is about 20 percent for service 
industries. 

For any establishment that does not provide the census data, responses are generally im- 
puted using tax form information available from the Internal Revenue Service (IRS). For 
payroll information, the IRS has four quarters of data available for each employer iden- 
tification (EI) number from tax forms. A company may have one or more EJ numbers. Payroll 
data for a particular company are obtained by adding up the payroll figures for all E] numbers 
used by the company. First quarter employment counts are also available by E] number from 
IRS records and can be aggregated to the company level. For sales/receipts, various IRS 
forms are used depending on whether the nonresponding company is a sole proprietorship, 
partnership, or corporation. 

The imputation procedure is complicated by the difference between the census enumeration 
unit and the IRS tax unit. For the business census, the unit of enumeration is the establish- 
ment (i.e., a single location). However, the tax unit for the IRS is an EI] number. There may 
be one or more establishments reporting under the same EI number. If a nonresponding 
company has only one location (i.e., is a single-unit company), then it will have only one 
EI number and imputation is straightforward. However, for a multi-unit nonresponding 
company imputation is more complex since, in general, IRS data will not be available for 
each establishment. In such a case, the company structure is determined first by referring 
to the SSEL to obtain a list of all establishments contained in a company and all EJ numbers 
used by the company. The total for a company for each data item is obtained by adding 
the item across all EI numbers used by the company, as discussed above. The company total 
is distributed to establishments by prorating the total based on the most recent data available 
for the company from an annual or monthly survey. If no data are available, an equal 
proration is used. If there is nonresponse for only a portion of the establishments in a multi- 
unit company, data for the nonresponding establishment are imputed based on prior 
year relationships. 


4.1.2 Manufactures, Mineral Industries 


In these two economic censuses, general information is obtained on the number 
of employees, hours worked, and on production levels by four-digit standard industrial 
classification (SIC) codes. Imputation rates vary from about 10 to 15 percent. The unit 
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nonresponse procedure used depends on the type of company that did not respond (i.e., single- 
unit or multi-unit) and on whether or not a previous year’s record is available. Thus, there 
are four types of nonresponse cases that occur. The method of treating nonresponse for these 
four cases follows: 


(1) Single-unit company, previous year data are available from the Annual Survey of 
Manufactures. 


In this case annual payroll is obtained from IRS tax forms and compared to the previous 
payroll total reported. The percent change from the previous period is determined. 
This percent change is applied to all data items in the previous record to obtain an 
imputed current record,execpt for employment and value of shipments whenever these 
are available from IRS. 


(2) Single-unit company, no previous year data are available. 


In this case, sets of ratios are developed between census items within each four-digit 
SIC, with payroll as the ‘‘seed.’’ That is, the relationships are developed in such a 
way that all items can be imputed from these relationships either directly or indirectly 
if a payroll figure is obtained. The specific relationships are derived from historic data 
reported by the respondents in the same industry. Then the (seed) value of payroll 
is obtained from IRS tax records and all other items are imputed from the relation- 
ships derived. 


(3) Establishment in a multi-unit company, previous year data are available for the 
establishment. 


First, for each four-digit SIC, an aggregate growth factor between the previous and 
current period is developed from external sources for each of the following key items: 
payroll, employment, change in inventory, and change in capital expenditures. These 
four growth factors are applied to the appropriate prior year data items for each 
establishment to obtain imputed responses for the current period. These four imputed 
items are then used as ‘‘seeds’’ to impute other items. 


(4) Establishment in a multi-unit company, no previous year data are available for the 
establishment. 


In this case, basic data on payroll and employment are obtained for each establish- 
ment from the SSEL discussed earlier in Section 4.1. As indicated, the SSEL obtains 
data on employment and payroll obtained for all establishments included in the COS. 
Then, using the SSEL data as a base, the data record for each establishment is imputed 
from relationships developed between the SSEL data items and the other census items. 
This procedure is analogous to that used in case (2) above. 


4.2 Economic Surveys 


The Census Bureau conducts a large number of monthly, quarterly, and annual economic 
surveys in addition to the economic censuses. In particular, most of the six census trade 
areas have monthly or annual surveys. The unit nonresponse procedures used for the 
Monthly Retail Trade Survey and the Truck Inventory and Use Survey are described below. 
The unit nonresponse adjustment procedure used for the Annual Survey of Manufactures 
(ASM) is not described here since it is virtually the same as that used for the Census of 
Manufactures, described in Section 4.1.2. Imputation rates for the ASM vary from 5 to 
10 percent. 
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4.2.1 Monthly Retail Trade 


The Monthly Retail Trade Survey includes about 30,000 reporting units: about 3,000 
selected with certainty and 27,000 selected on a probability basis. The certainty cases are 
surveyed each month, while a third of the noncertainty cases are surveyed each month. This 
provides a monthly mailing of about 12,000 reporting units. For a multi-unit company in the 
survey, a subsample of the establishments in the company is selected for inclusion. Monthly 
retail sales is the only item enumerated in the survey. The imputation rate for retail sales 
is about 11 percent. 

If a single-unit certainty company or a sample establishment in a multi-unit certainty 
company does not report for a given month, a value for sales is imputed from the previous 
month’s figure by multipling it by a ‘‘ratio of identicals.’’ This adjustment ratio is derived 
by dividing the weighted sum of the current month sales by the weighted sum of the previous 
month sales for all establishments in the same adjustment cell for which sales were reported 
for both the current and previous months. Adjustment cells are generally defined by the first 
three digits (or four digits in a few cases) of the SIC code, by type of establishment (..e., 
whether or not it belongs to a large multi-unit firm), and by sales size class. The weight used 
for each reporting unit used in computing the ratio of identicals is the inverse of the pro- 
bability of selection of the reporting unit. 

If a multi-unit certainty company does not report sales for any of its establishments, the 
sales values are imputed for each establishment and for the entire company as in the previous 
case: applying the ratio of identicals for the appropriate adjustment cell to the previous month 
sales figures. If such a company does report current monthly sales for the entire company, 
the imputed establishment responses are ratio adjusted to be consistent with the reported 
total for the entire company. 

For noncertainty companies, imputation for missing sales data is carried out in a way 
similar to that used for certainty cases, except that an extra step is required since noncertainty 
companies report every three months. The first step is to impute the previous month’s sales 
for a nonrespondent based on the response provided three months ago. This is done by 
multiplying the sales reported three months ago by a ratio of identicals based on the weighted 
sum of sales during the previous month and the weighted sum of sales three months ago 
(cell by cell). Once the previous month sales are imputed, the current month sales is generated 
from the imputed value for the previous month using the same method described for 
certainty cases. 

If a nonrespondent is in the survey for the first time, the previous month’s sales (if it’s 
a certainty case) or the sales figure three months earlier (if it’s a noncertainty case) is imputed 
from the sales reported in the most recent census, if available. If the nonrespondent was 
not in the most recent census, then it would be a birth case for which two months of sales 
data generally would have been provided at the time the company was added to the frame. 
This data would be seasonally adjusted and then inflated to an annual-based figure. The 
imputation would then be carried out as though a census sales figure had been available for 
the nonrespondent. 


4.2.2 Truck Inventory and Use Survey (TIUS) 


The TIUS is conducted every five years and provides data on the physical and operational 
characteristics of trucks nationwide. These characteristics include type of trailer (vehicle con- 
figuration), kinds of products carried, type of gasoline used, and annual miles driven. The 
universe for the survey consists of the truck registrations from all 50 states and the District 
of Columbia. The sample size is about 120,000 truck registrations. About 75 percent of the 
trucks selected for the survey respond. 
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Adjustments for unit nonresponse are made by “‘weighting up’’ the respondents to 
the total sample, separately within weighting classes. The weighting classes are taken 
to be the sample strata which consist of cross-classifications by state and body type 
(5 categories). The nonresponse weight adjustment is based on the number of trucks; within 
each class (stratum), the initial weight of each respondent is multiplied by the ratio of 
the number of trucks in the stratum to the sum of the initial weights of the respondents 
in the stratum. 

Of the economic surveys investigated, the TIUS is the only one that uses a weight 
adjustment procedure to account for unit nonresponse. With other economic surveys, 
alternate sources of basic information are generally available to ‘‘build’’ a record for 
a nonrespondent. 


4.3 Census of Agriculture 


The census of agriculture provides data relating to the Nation’s farming, ranching, 
and related activities. It is the leading source of agricultural statistics and the only 
source of consistent, comparable data about agriculture at the county, State, and 
national levels. 

The task of nonresponse adjustment for the census of agriculture is made complex by 
the fact that the SSEL cannot be as effectively used as it is in the other economic areas. 
The agricultural census mailing list is constructed by combining several overlapping sources. 
The resultant frame may contain some duplication and always contains some nonfarm en- 
tities. Thus, the nonresponse methodology must first identify, or estimate, the extent to which 
an adjustment is needed before it can take place. 

For the 1982 census, nonrespondents were designated as large or small based on whether 
their expected sales were above or below $100,000. A 100% telephone follow-up was 
conducted for all of the large nonrespondents. The small nonrespondents were then stratified 
based on other mail list characteristics. A sample of these units was followed up by mail 
and telephone to obtain estimates, by strata within States, of the percent of nonrespondents 
which were actually farms. These estimates were then used, along with data on in-scope 
percents of respondents by county, to make estimates of the number of nonrespondent farms 
at the county level for each stratum. The weights of a randomly selected sample of respondents 
by county, consistent with the estimated number of nonresponding farms, were then inflated 
by two. All other respondents retained their weight of one. 


4.4 Research Activities for Nonresponse Adjustments in Economic Surveys 


Probably the most important source of information for unit nonresponse imputation in 
economic surveys is IRS data from tax forms. Some differences between the IRS figures and 
those collected in the economic census may arise because of differences in definitions, forms, 
or the data collection procedures used. A study by Dyke (1984) compared administrative 
(IRS) data used to impute sales/receipts, payroll, and employment in the 1977 business 
census with corresponding responses obtained in a follow-up sample of nonrespondents. In 
general, he found that the survey values reported in the follow-up survey exceeded those 
obtained from administrative sources. The sizes of the differences varied by item. Also, the 
differences were more pronounced for multi-unit establishments. Additional comparisons 
of this type are needed. If systematic differences are identified, adjustment factors to apply 
to IRS figures may be developed. 
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For several of the censuses and surveys, a ‘‘ratio of identicals’’ is calculated and 
used to obtain a factor to apply to a previous-period figure to obtain an imputed value 
for the current period. It is possible that this ratio computed among a// sample cases 
that reported in both periods may not apply very well to the nonrespondents for some 
items. Bailey (1986) looked at alternatives to using ratios of identicals for imputing missing 
values such as linear regression and quadratic regression, using various sets of independent 
variables. 

With many of the economic unit nonresponse imputation methods, the sample cases - 
both respondents and nonrespondents - are placed into cells prior to computing (a) 
some type of ratio between current and prior periods for an item or (b) some type of rela- 
tionship between the survey items and the basic items: payroll, employment, and receipts. 
A research project to investigate alternate choices of cell definition for the Monthly Retail 
Trade Survey was recently completed by Huang (1986). She found that for some SIC’s an 
alternate procedure of defining cells reduces the mean square error (MSE) of estimated sales 
substantially. In addition, she compared the current method of imputing - using ratios 
of identicals - to three alternate methods with respect to bias and MSE. The current method 
was evaluated as the second best procedure. However, she concluded that the slight gains 
of the optimum procedure may not be worth the additional requirements associated with 
using it. 


5. IMPUTATION FOR EARNINGS IN THE CURRENT POPULATION SURVEY 


5.1 The Hierarchical Hot Deck 


The Current Population Survey (CPS) is a Census Bureau ongoing monthly survey of 
about 60,000 U.S. households per month. The CPS, sponsored by the Bureau of Labor 
Statistics, primarily collects labor force and employment information. Each March, the 
CPS administers an income supplement as part of the survey questionnaire. About 11-12% 
of the sample members do not respond to the income questions. Therefore, a special 
procedure, referred to as the ‘‘hierarchical hot deck,’’ has been developed to impute for 
missing responses. 

With the hierarchical hot deck, missing earnings values are inserted from the response 
record of another sample unit - a donor. The goal in selecting a donor is to find one with 
survey characteristics similar to those of the item nonrespondent. The first step in the pro- 
cess of finding suitable donors is to partition the entire sample, excluding total noninterview 
cases, into cells based on multi-way classifications of a number of survey characteristics. 
Within each cell a list is made of the respondents and nonrespondents for a given item. Donors 
from the list of respondents are assigned to the nonrespondents systematically, with a ran- 
dom start. If there are more nonrespondents than there are respondents in a cell for a given 
item, the responses of some, or perhaps all, of the respondents in the cell will be used more 
than once. In some cells, there may be one or more nonrespondents but no respondents for 
an item. 

To avoid the problem of having nonrespondents with no donors available, the process 
of defining cells and selecting donors for the item nonrespondents is carried out several times. 
At each stage, fewer cells are defined than were defined for the previous stage. For the final 
stage the number of cells defined is small enough so that it is certain that there will be donors 
available in each cell. The cells defined at successive stages are formed by collapsing the cells 
used at the previous stage. Each item nonrespondent will have one or more donors assigned. 
The donor used to obtain an imputed value will be the one identified at the earliest stage. 
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The major advantage of this hierarchical procedure is that a very large number of cells can 
be defined at the first stage, due to the backup stages used. Whenever a donor is found at the 
first stage, the item nonrespondent and donor will be matched on a large number of survey 
characteristics. In such cases there should be a good chance that an adequate imputation 
is made. In other cases the item nonrespondents and donors will be matched on fewer 
characteristics. This hierarchical procedure trys to pick donors in a way that maximizes the 
number of matched relevant survey characteristics. 

For a more detailed description of this of this procedure, see Welniak and Coder (1980), 
Oh and Scheuren (1980a), or David, Little, Samuhel, and Triest (1986, Section 2). 


5.2 Evaluation of the CPS Hierachical Hot Deck 


There have been some evaluation studies of the CPS Hot Deck: Welniak and Coder (1980); 
Oh and Scheuren (1980a and 1980b); Lillard, Smith, and Welch (1982); and David et al. (1986). 
One of the weaknesses noted of the CPS hot deck is that donor values may be used repeatedly, 
resulting in variance increases. The procedure could be modified to avoid using donor values 
more that once or twice; however, this change has not been made. The CPS hot deck pro- 
cedure is based on the assumption that the distribution of responses for a survey variable 
is the same for respondents and nonrespondents in the same cell - the ignorability assumption. 

David et al. (1986) developed several model-based alternatives to the CPS hot deck and 
evaluated them and the CPS hot deck with respect to mean absolute and mean relative error. 
These evaluations were based on a CPS-IRS matched file. In creating this file, an attempt was 
made to match the March 1981 CPS file to the IRS tax records for 1980. Despite the hot 
deck’s apparent limitations, the CPS hot deck had a lower mean absolute and mean relative 
error than did the model-based alternatives. However, the models were developed for only 
10% of the full CPS sample used to develop the hot deck procedure. 


6. SUMMARY AND AREAS OF FUTURE STUDY 


In this paper an attempt has been made, primarily through examples, to describe the current 
approaches being taken to nonresponse adjustments in the U.S. Census Bureau’s censuses and 
surveys. Emphasis has been placed on the need for additional empirical and theoretical studies 
in both the demographic and economic areas in order to provide more objective guidelines (a) 
to design nonresponse compensation procedures and (b) to measure the effects of nonresponse 
on survey results for a variety of survey conditions. 

Some of the research called for in this paper is already underway but more will be needed. 
For example, to what extent can available ancillary data be used in conjunction with model- 
ing and data analysis procedures to identify the key functional relationships needed to pro- 
vide a ‘‘reasonably”’ accurate description of the response/ nonresponse structure applicable 
to a given survey? 

In general, adjusting for nonresponse is just one of several steps taken to reduce the variance 
and bias of survey results. The degree to which these other steps aid in reducing the impact of 
nonresponse is an area for further research. Moreover, there should be continued efforts in 
support of research on recurring issues such as the impact of unit nonresponse weights and item 
nonresponse imputation on complex variance estimators, model approaches to determining 
appropriate adjustment factors, and the effectiveness of combining various types of 
nonresponse adjustment techniques. 
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Hot Deck Imputation Procedure Applied 
to a Double Sampling Design 


SUSAN HINKINS and FRITZ SCHEUREN! 


ABSTRACT 


From an annual sample of U.S. corporate tax returns, the U.S. Internal Revenue Service provides estimates 
of population and subpopulation totals for several hundred financial items. The basic sample design 
is highly stratified and fairly complex. Starting with the 1981 and 1982 samples, the design was altered 
to include a double sampling procedure. This was motivated by the need for better allocation of resources, 
in an environment of shrinking budgets. Items not observed in the subsample are predicted, using a 
modified hot deck imputation procedure. The present paper describes the design, estimation, and evalua- 
tion of the effects of the new procedure. 


KEY WORDS: Double sampling; Hot deck; Imputation. 


1. INTRODUCTION 


When the U.S. Internal Revenue Service (IRS) is mentioned, the first words to cross 
one’s mind may not be “sample surveys.’ But every April, those of you from the USS. 
take part in at least one of our administrative “surveys” and file an individual income 
tax return. We sample this administrative data annually for statistical purposes. Another 
of our major programs is an annual sample of U.S. corporate tax returns; that is the sample 
survey discussed here. 

The primary interest at a Symposium like this is in non-response or other undesirable missing 
data. Despite our extensive enforcement efforts, we at IRS also have such non-response 
problems. However, the present paper is concerned with a different type of missing data 
problem: missingness that is not unexpected, but is designed (see also, Strudler, Oh, and 
Scheuren 1986, for another example). We take the liberty of discussing these problems because 
we use techniques usually associated with non-response, e.g., hot deck imputation (Ford 1983). 
Our case allows an evaluation of the imputation procedure, since the underlying non-response 
mechanism is known. 

Double sampling has been introduced in our corporate tax return sample in an effort to 
reduce costs with only a “tolerable” loss of information. Reweighting to account for the sub- 
sampling stage is a standard estimation approach in double sampling (e.g., Cochran 1977); 
however, in our application, we would have had to reweight almost on an item-by-item basis. 
This was judged unacceptable by our users, who require rectangular data sets. (For an analogous 
approach in a Canadian context, see Colledge et al. 1978.) 

The imputation technique used - hot deck imputation - is procedurally simple. The need 
to discuss the application of such a relatively simple procedure may surprise theoreticians; 
but, as we will show, the problems of implementation within the setting of a large statistical 
Operation are many. 
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In the remainder of the present paper, we describe in some detail the double sampling 
procedure and the imputation technique employed. Preliminary results on the impact of these 
procedures are also presented and the last section contains our conclusions and future plans. 
A brief theoretical discussion of the estimators we are using and their properties is given 
in an Appendix. 


2. DESCRIPTION OF THE SAMPLING PROCEDURES 


An annual sample of U.S. corporate tax returns is used by IRS to estimate National totals 
of both tax and economic variables. For example, approximately three million corporate 
tax returns will be filed for 1985, and the IRS sample will contain over 90,000 of these returns. 
(In Canada, there are two separate corporate tax return samples, each designed to meet nar- 
rower purposes. The Revenue Canada Taxation sample (e.g., Burpee and McGrath 1982) 
was developed for tax policy simulation purposes. The Statistics Canada sample (e.g., Am- 
brose 1985) is intended primarily to estimate economic aggregates. It is our belief that separate 
designs in the U.S., but not entirely separate processing systems, could lead to improvements 
in efficiency over the current procedures; however, the work done (Clickner et al. 1984) in- 
dicates that the problem is quite difficult and progress has been slow.) 

The annual estimates obtained are for the entire corporate population and for subpopula- 
tions, usually defined by industrial activity and size. The underlying population is highly 
skewed. For most variables, a small proportion of the population accounts for a substantial 
fraction of the total dollar amount. Examples for 1982 corporations are given in Exhibit 1. 

A highly stratified sample design is used; small corporations are selected with small pro- 
bability and large corporations are selected with certainty (Jones and McMahon 1984). The 
strata are defined by industrial classification and the size of the corporation (i.e., in terms 
of assets and net income). Selection probabilities for each stratum are determined by employing 
a modified form of Neyman allocation. Almost all of the returns in the 100% strata (returns 
selected with certainty) have total assets of $50 million or more. A form of post-stratified 
raking ratio estimation is used to weight the sample results (Leszcz, Oh, and Scheuren 1983). 

Retrieving the information from each sampled return is a time-consuming and expensive 
process. Over 600 items may be retrieved from a return, and these items are not simply extracted; 
they are also carefully checked and redistributed to compensate for taxpayer reporting varia- 
tions. The complete process is referred to as ‘‘editing the return’. The cost of ‘‘editing’’ 
varies by degree of complexity. It may take only twenty-five minutes to edit a fairly simple 
return but as long as a week to edit a really complicated one. The quality of the editing is 
vital to our estimates, as these checks reduce, but do not eliminate reporting inconsistencies. 


Exhibit 1 
Degree of Concentration of Selected Corporate Variables 
Assets Assets 

Selected 
Hutte Under $50 Million 

$50 Million or more 
Number of Returns 99.6% 0.4% 
Total Assets 16.3 S307 
Total Receipts 39.3 60.7 
Total Income Tax 2529 74.1 


Source: Internal Revenue Service, 1985. 
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Indeed, nonsampling error is a serious concern in the data “‘editing’’ process, particularly 
for the largest corporations. In order to spend proportionately more resources on reducing the 
nonsampling error for the large returns, we introduced stratified double sampling for the 
smaller returns; specifically, certain data items were retrieved on only a subsample of the 
returns (i.e., a subset of returns with assets under $50 million). Although this change would 
increase the error for some variables on the small returns, we expected that the procedure 
would have little adverse effect on the estimates of national totals, or on the subdomain 
estimates of primary interest to our major users. There were two main reasons for this con- 
jecture: 


- As already noted, corporate returns with total assets of $50 million or more were not 
subject to the extra sampling step. 

- The information loss due to the subsampling was reduced by the choice of the items 
or variables to be subject to subsampling. 


By and large, as will be shown, the results obtained so far confirm our expectations. 


Items Selected for Subsampling 


When certain miscellaneous items on a return are nonzero, the taxpayer must attach a 
schedule providing additional information. For example, if the item ‘‘Other Income’’ is 
nonzero, the corporation must describe what was included under this category. The schedules 
are attached on separate sheets of paper and have no standard form or length. The process 
of editing a schedule has several parts: finding the schedule, deciding whether the taxpayer 
included appropriate amounts in ‘‘Other Income’’, and making changes if there are errors. 

Beginning with the tax year 1981 corporate program, the statistical editing of data from 
the tax return was done in stages, and certain items were initially transcribed for statistical 
use directly from the return. Employing automatic tests, items or schedules could then be 
“‘flagged’’ for abstraction or further scrutiny in later stages (Cys ef al. 1982). This new strategy 
allowed us to: 


- Retain original taxpayer information as reported so that the amount of editing change 
could be evaluated. Prior to the 1981 sample, we had no information regarding the 
extent of the adjustments being made by editing. The editors only recorded the final 
result. (See Powell and Stubbs 1981.) 

- Decide whether or not to review a particular schedule based on the initial information 
transcribed. (Again, prior to the 1981 program, editors were, of course, required to 
completely edit all schedules.) 


For the 1981 and 1982 corporate programs, seven items and their associated schedules 
were picked for subsampling: schedules for Other Income, Other Deductions, Other Costs 
of Goods Sold, Other Current Assets, Other (Noncurrent) Assets, Other Current Liabilities 
and Other (Noncurrent) Liabilities. 

The reported amounts on a corporate return may be modified substantially as a result 
of the editing. For example, consider the ‘‘Other Income’’ schedule shown in Exhibit 2. The 
original amounts (in column 1) are observed initially for every return. The variables being 
subsampled are changes that would be made if the Other Income schedule were edited (col- 
umn 2). In this hypothetical case, we have an original Other Income amount of $1 ,600, which, 
when examined by the editor, could be reclassified as including $900 from Business Receipts, 
$300 in Rents and $400 that really belongs in Other Income. The variables of interest are, 
of course, the final (‘‘corrected’’) amounts for each item. 

Before implementing the new processing system, an experiment was run comparing the 
amount of time it took to do the reduced, initial transcription and the amount of time it 
took to do the complete editing (reading all schedules). As expected, the reduced edit was 
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Exhibit 2 


Illustration of Editing Other Income 


aye Original Change Final 
Amounts($) Amount($) Amounts($) 
Other Income 1,600 — 1,200 400 
Receipts 500 + 900 1,400 
Rents 0 + 300 300 
Interest 700 0 700 


significantly faster (and therefore, cheaper). Considerable resources could be saved by sub- 
sampling. (Conservatively, we extrapolated 1981 cost savings of at least $300,000, assuming 
only limited use of the subsampling technique.) 


Double Sampling 


We are now ready to describe the basic two-dimensional stratification chosen for our double 
sampling. The returns are stratified into ‘‘crucial’’ returns (Group A) versus the remaining 
returns (Group B). ‘‘Crucial’’ returns include all returns with total assets of $50 million or 
more, thereby including the important ‘‘large’’ returns and most returns selected into the 
sample with certainty. In addition, crucial returns should include corporations of any size 
for which the likelihood of an editing change was high. What we want, obviously, is a sub- 
sampling plan that has us edit all schedules that have a high probability of a change (especially 
a large change) and lets us subsample the rest. 

In an attempt to predict which schedules are likely to change, a record is included in Group 
A if the original amount in Other Income, to continue our illustration, is unusually large 
compared to the amount in Total Income. 

Also, since we do not want to impute large amounts, cases where Other Income is above 
a certain dollar value should be included in Group A, as well. (Unfortunately, this was done 
only indirectly.) By inference, Group B is supposed to include only small returns which we 
believe are likely to have little or no change made as a result of editing. (See Barker et al. 
1982, for details.) 

For the crucial returns in Group A, all variables (items) are always completely observed. 
Only returns in Group B are subject to the subsampling of the seven schedules mentioned 
earlier. Even for Group B returns, the original amounts for all items are always recorded; 
therefore, some information is obtained for every item. The information not obtained for 
some records in Group B is the change due to editing a schedule. It is these changes that 
are being imputed using the procedure described in the next section. Not all variables are 
affected by the subsampling. For example, of the 600 items picked up for the 1981 corpora- 
tion program, only 56 were in any way affected by the double sampling; however, of the 
approximately 100 major income and balance sheet items, nearly one half could be affected. 


3. THE IMPUTATION PROCEDURE 


The missing information (i.e., changes from editing) in Group B was imputed using a 
hot deck procedure within adjustment cells. A record with schedules to be imputed was mat- 
ched to a donor record, in the same adjustment cell, with these same schedules edited. (The 
formation of adjustment cells is described later in this section.) 
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In 1981, the subsampling rate was 10% for the returns subjected to subsampling: one out 
of ten was selected systematically for editing (these were the hot deck ‘“‘donors’’) and the 
other nine were left to be imputed. In 1982, the subsampling rate was kept at 10% for non- 
financial returns (trade, manufacturing, etc.) but was raised to 20% for financial returns 
(banks, insurance companies, etc.) 

Within an adjustment cell, the number of returns, n’, can be divided into the number 
of donors, n” , and the number of imputes, n’ — n”. Because of the small subsampling rate, 
the number of donors is almost always smaller than the number of imputes. In particular, 
letn’ — n” = rn” + twhererand tare nonnegative integers and 0 < ¢t < n”. Then the 
hot deck procedure selects all n” donors r times, and selects the remaining ¢ units by simple 
random sampling without replacement. 

To continue our illustration, recall that the item of interest is Z, the final ‘‘corrected’’ 
amount for Other Income; Z can be written as Z = X — Y, where X is the original tax- 
payer amount in Other Income and Y is the change made due to editing the Other Income 
schedule. It is only the change, Y, that is unobserved and must be estimated for a subset 
of the returns in Group B. 

If we simply employ a conventional hot deck procedure and estimate the unobserved y; 
value, on record i, with the observed value y; from donor record /, then the resulting estimate 
of the final value z; may not satisfy the edit checks. For example, assume the donor record 
had $30,000 originally as Other Income, and $15,000 was removed when the schedule 
was edited. Suppose that on the record to be imputed, the Original amount in Other 
Income is $10,000, then the imputed change of $15,000 would result in a negative estimate 
for other income: 


Zz, = xX; —.¥; = 10,000 — 15,000.= —5,000. 


Since the amount for Other Income must be nonnegative, edit checks would fail and ad- 
ditional adjustments would have to be made to the record. (See Sande 1982, for a general 
discussion of this problem.) Since the original amount is always observed, it seemed more 
reasonable to “‘hot deck’’ the relative change R = Y/X rather than the actual change Y. 
In this example, since the donor record had one half of the amount in Other Income remov- 
ed after reading the schedule, then 1/2 should be removed on the imputed record. The 
estimated final amount in Other Income is then 


2; = x; — 9; = 10,000 — (1/2)10,000 = +5,000. 


In addition to satisfying the edit checks, we expected the ratio procedure to reduce the 
variance of our estimates relative to the basic hot deck approach; however, the variance of 
the estimator is not analytically tractable and must be measured empirically. We have not 
yet verified in our corporation application the smaller variance that we conjecture; but simula- 
tion results do support the approach we have taken. However, by introducing the ratio, our 
estimators are now biased. We conjectured that the biases would be small and in fact they 
were, for the most part, as we shall show. 

The model associated with our imputation procedure is based on the definition of the 
double sampling strata being used and on the definition of the adjustment cells. Several con- 
structive steps were taken to make the approach reasonable. In the initial stratification, an 
attempt was made to subsample only those records that were likely to have no changes or 
only small changes. Also, the adjustment cells were subjectively chosen to be homogeneous 
with respect to the magnitude of the relative editing change that might be made. In particular, 
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The coded tree branches above correspond to the following: 


A = Retail, B = Wholesale, C = Transportation and Utilities, D = Other, E = Very Small, 
F = Small, G = Medium. 


Figure 1. Hierarchy of Ratio Hot Deck Adjustment Cells 


the adjustment cells are defined in terms of industrial classification, corporation size and 
the pattern of items present on the return. There were thirty categories defined by various 
industrial and size criteria (see Figure 1). In addition, sixteen item patterns were treated 
separately, defined by the presence/absence of Other Income (2 classes), the presence/absence 
of either Other Deductions or Other Costs of Goods Sold (2 classes), Other Current Assets 
or Other Assets (2 classes) and, finally, Other Current Liabilities or Other Liabilities (2 classes). 
The maximum number of adjustment cells was 30 x 16 = 480. 

For each item pattern, a hierarchical structure was developed so that collapsing could be 
done when there were an insufficient number of donors for use in the imputation (see Figure 
1). The first division is into financial returns (banks, insurance companies, etc.) versus non- 
financial records; cells are not collapsed across this division. The next levels of the hierarchy 
separate cases according to fairly broad industrial classes and according to the size of the 
corporation, in terms of assets and net income. Recall that the largest corporations are not 
subject to subsampling and, so, should not need imputation; hence, broad industrial and 
size groups seemed sufficient. 

The quality of our estimation depends on how much collapsing takes place. In 1981, we 
had 36,586 returns with at least one schedule to impute, and 3,989 donors. For the non- 
financial returns we never collapsed across the major industrial classification, and, in fact, 
we always had some size distinction. Many cells were not combined at all, but maintained 
the maximum detail possible. In contrast, for financial returns the size variable was often 
lost by combining all cells, and major industries were sometimes combined (Hinkins 1983). 
For one pattern, all financial returns were combined into the same cell. 

Based on our 1981 experience, several changes were made in the 1982 double sampling design: 


— Due to the extensive collapsing of cells for financial returns in 1981, the subsampling 
rate for small financial returns was doubled to improve the estimates (from 10% to 
20%, as noted earlier). 
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Table 1 
Selected Statistics on Hot Deck Ratio Imputation, 1981-1982 


Sa at i tig OR 


Tax Year 1981 Tax Year 1982 
Item 
Non- Non- 
Financial financial Financial financial 


Se es ee ic 
NUMBER 


Donors 908 3,081 1,806 4,697 

Imputes 7,912 28,674 10,719 43,477 

Adjustment Cells 113 238 142 260 
DONOR CELL SIZE 

Average 8 13 13 18 

Maximum 68 58 126 98 

Minimum 1 1 D 2, 
DONOR-TO-IMPUTE RATIOS 

Average dll 11 le sill 

Maximum 1.00 PS 2.00 28 

Minimum 05 .05 .05 .05 


— Se ee eee eee ee 


Note: For 1982, cell sizes of 2 donors each were required in order to make possible the calculation of the variance. 


- In 1981, the double sampling procedure was not applied across the entire sample, but 
was restricted to certain processing centers. Other processing centers collected all in- 
formation, as before. In 1982, the procedure was applied across the whole sample. The 
relative number of records in1982 with some items imputed was 63 percent, compared 
to 40 percent in 1981. 

- In order to estimate the hot deck imputation variance (Oh and Scheuren 1980; Rubin 
and Schenker 1986), an additional restriction was imposed on the 1982 design, in that 
we required that there be at least two donors in each adjustment cell. (See Table 1.) 


In 1982, there were 54,196 records to be imputed from 6,503 donors, and there was con- 
siderably less collapsing of adjustment cells (Hinkins 1984). In particular, for financial records, 
94 percent of the records imputed in 1982 were in adjustment cells defined with some size 
distinction, compared to 75 percent in 1981. Table 1 provides a selection of other statistics 
on the operation of the 1981 and 1982 systems. 


4. INITIAL EVALUATION OF BIAS 


The evaluation of the 1982 double sampling system is still underway, but some initial results 
are available on the potential biasing effects of the imputation. Bias should, be small if R, 
the ratio of the editing change to the original amount, is always small, or if R is constant 
within adjustment cells. We have taken the approach of looking for the ‘‘worst’’ cases of 
bias by looking for examples where R is neither small nor constant. We confine attention 
to only two variables: Other Income and Business Receipts. 


188 Hinkins & Scheuren: Hot Deck Applied to Double Sampling 


Nonfinancial Records Financial Records 


$400,000 F $400,000 


$400,000 $400,000 
Original Amount X Original Amount X 


Figure 2. Changes in Other Income: Group B Donors only 


Unbiased Model 


The ratio bias in the hot deck imputation we are using would be zero if the relationship 
Y = RX were to hold for all members of each adjustment cell chosen. An overall plot of 
the data might be useful, to look at the degree to which this model holds for Other Income. 
In Figure 2, therefore, we have plotted the Group B donors separately for financial and non- 
financial corporations. There is a distinct difference between these two categories. Nonfinan- 
cial returns are much less likely to change; in 1982, 14 percent of the nonfinancial donors 
had a change made to Other Income, compared to 59 percent of the financial records. Also, for 
financial returns at least, it looks as if the model E( Y) = RX might be appropriate. Further 
work along these lines is intended, but the scatterplot encourages us to believe that, by and 
large, existing biases would be small. 


Actual Bias Measures 


Table 2 provides relative bias measures for selected worst case industries. These are shown 
for all returns in that industry and returns with assets under $25 million (i.e., for corpora- 
tions likely to be most affected by the new procedures). Of the items changed in the double 
sampling the Other Income schedule showed some of the largest values of R and the most 
disperse distributions of R. The greatest change as a result of editing Other Income was made 
in the Business Receipt amount. It should be noted that the bias estimates in Table 2 are 
subject to considerable sampling error (Czajka 1986). Except for the very smallest amounts, 
however, it is conjectured that the estimates shown probably have the correct sign and are 
of the appropriate order of magnitude. 

These examples indicate that within small subpopulations, there can be noticeable bias 
effects. However, even within a major industry, selected for its potential problems, the bias 
across all sizes is relatively small. 
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Table 2 


Estimated Relative Biases for Business Receipts and Other Income 
by Selected Minor Industries, 1982 
nS SS 


Business Receipts Other Income 
Selected 
Minor Assets Assets 
Industries All Under All Under 


Returns $25 Million Returns $25 Million 


(Biases as percent of applicable total) 


WHOLESALE TRADE 


Machinery, Equipment and Supplies — 1.40 —2.6 0.4 0.6 

Miscellaneous Trade — 0.30 —0.5 —1.3 —2.4 
RETAIL TRADE 

Auto Dealers and Service Stations — 0.30 —0.5 3.3 4.6 
FINANCE AND INSURANCE 

Banking — 0.02 —0.7 0.1 2.4 

Credit Agencies Except Banks — 0.50 —2.2 —0.9 —9.0 

Insurance Agents — 0.60 = 027 2 p43) 


SS SS ee ee ee eee eee 


Note: All calculations are based on design-weighted estimates of the biases involved. The industries were selected 


to represent worst case examples. 


Czajka’s results (1986) indicate that for global estimates (across all industries), the bias 
effect of the imputation is small (less that 1% in all cases; considerably less than .05% in 
most cases). 

There is no question that some of the biases in Table 2 appear large and warrant concern; 
however, it is important to realize that the overall effect on the root mean square error of 
the bias is small for all returns, generally 5% or less. These results give us strong evidence 
that the procedures employed did little or no harm to the data needed by our users; that, 
however, is not to say that major improvements, like those envisioned for 1985 and 1986, 
should not be made. 


5. FUTURE PLANS AND SUMMARY 


Double sampling and imputation were not used for the 1983 and 1984 samples because 
of processing constraints. They will be used again starting with the 1985 sample. As part 
of reinstituting the imputing process, we are planning to make several changes: 


- It will no longer be necessary to initially transcribe certain items for statistical purposes 
before subjecting the records to double sampling. The fields needed are now being ob- 
tained directly from the IRS revenue processing system, so they are available before we 
begin reading and editing the tax return; thus, before editors first look at a return, we 
can designate whether or not they should review certain schedules. This makes the use 
of stratified double sampling even more appealing; the savings should increase. 

- However, because of the new processing system, only three schedules are now available 
for subsampling. The schedules for 1985 are Other Income, Other Deductions and Other 
Costs of Goods Sold; the remaining four schedules used in 1981 and 1982 had to be 
dropped from the subsampling design. 
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- Despite the modest success of the 1981 and 1982 procedures, changes will be made for 
1985 in the imputation methods. For example, the current definition of the adjustment 
cells could be improved, and separate imputation depending on the pattern of items 
represented needs to be reconsidered. The possible use of predictive mean matching 
within adjustment cells also bears examination (Little 1986). For 1986, refinements in 
the subsampling plan will need to be looked at too. 

- Finally, we would like to base our estimates, in some way, On previous years’ data, 
so as to be able to impute missing information earlier in the processing. In order 
to minimize the collapsing of adjustment cells, the 1981 and 1982 imputation process- 
ing had to wait for all records to be available. This delayed production by several weeks. 
We could avoid this problem by further increasing the number of donors; but, the editing 
of more records has the obvious disadvantage of increasing costs. On the other hand, 
by basing our approach in part on the previous year’s data, we might not only improve 
the estimation, but also allow the imputation calculations to be done in the mainstream 
of processing. 


Overall Summary 


In this paper, we have described the reasons we had for making major changes in our 
statistical processing of corporate returns: 


— The traditional complete data estimate was rejected in favor of double sampling because 
of cost considerations. 

- The usual double sampling estimator (reweighting the complete data) was rejected 
because it did not result in a rectangular data set. 

— Acconventional hot deck approach was rejected because the resulting estimates could 
fail the edit checks. 


Instead, the relative change was estimated using ratio hot deck imputation within adjust- 
ment cells. 

We conjectured that because the double sampling procedure was restricted to a subset 
of the ‘‘small’’ corporations, the estimates of interest to our major users should be virtually 
unaffected; indeed, these estimates could even be improved, by better allocating our resources 
to validate and correct the records of the larger corporations. Our results so far largely vin- 
dicate these conjectures. 

Compared to the traditional complete data estimator, the use of double sampling 
and hot deck imputation increased the mean square error of estimates in two ways; bias 
was introduced, and the variance of the estimator was increased. Our preliminary results 
indicate that there could be a significant bias effect for some estimates; however, the 
examples were chosen because they appeared to be cases where the hot deck ratio method 
would be weakest. Even so, the estimated overall effect of the procedure on the root 
mean square error appears relatively small. Looking at the increase in variance, the largest 
component is usually due to the decrease in sample size (double sampling). This increase 
in variance also turned out to be relatively small, since only one component of the final amount 
(the change) is imputed; the variance of the original values appears to dominate the variance 
of the changes. 

In conclusion, while there are improvements to make, we feel encouraged to continue 
with our current double sample design and imputation technique. Perhaps at another 
Conference of this type we will be able to report on the further results of our research. 
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APPENDIX: SOME BASIC THEORY 


This appendix provides some technical details on the double sampling procedure as applied 
in our particular situation. We contrast several potential estimators for the double sampling 
design we chose. An overall summary of the bias and variance expressions for these different 
approaches is found in Table A. 

For this discussion, we ignore the underlying stratified sample design and act as if a simple 
random sample had been taken, or equivalently we consider estimates within a sampling 
stratum. To do otherwise would make the notation exceedingly complex, but would not change 
the main points we wish to make. 

Let us again consider just one of the items subject to subsampling, namely Other Income 
as before. The variable of interest is Z, the final, corrected value of Other Income, and Z 
can be decomposed as 


Lek my, 
where X = the original taxpayer (or revenue processing) value of Other Income, 
Y = the change made to Other Income after reviewing the schedule. 
The population values and parameters are indicated by upper-case letters and the sample 


statistics by lower case. The population parameters of interest are the finite population mean 
and variance, i.e., 


SZ) at Dye Z im) (Ni 1) 


Complete Sample - Prior to the introduction of double sampling, the estimates were calculated 
from a complete sample of size n’, and the unbiased estimator of 7 was 


Z= YP z;/n’ 
Me Nee 
Ignoring the finite population correction (N is large), the variance is 


Var(Z) = S?(Z)/n’. 
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Table A 


Selected Properties of Alternative Estimators 


Jk ee ee eee 0 


Estimator Bias Variance ety 
Dt ert A a acti, th tl duns he 
Complete Sample 0 Var(Z) es 
Double Sample 0 Var(Z) + GySa( 1) Yes 
Hot Deck 

Amount (Y) 0° Var(z) ie, (e+ Gye Y) No 

Ratio (R) b, Var(Z) + V, Yes 
Combined Ratio b, Var(Z) + V, Yes 


ee 


4 In general, the basic hot deck procedure is unbiased only when it results in final values that satisfy the edit checks. 


In Table A, we use the properties of Z as a benchmark, to compare among alternative 
estimators. 


Double Sampling Estimation - Using Cochran’s notation (Cochran 1977, 12.2), the original 
sample of size n’ has now been stratified into the two groups A and B, with 14’ and np’ 
units respectively. A subsample of size ng is selected from group B. The original taxpayer 
amount X is recorded for all n’ = n4’ + np’ records. The changes due to editing Other 
Income, Y, will be recorded for all n4’ units in group A and for the random subsample 
of ng units in group B. 

Since the double sampling procedure only applies to variable Y, within group B, the double 
sampling estimator of Z is 


Zip Cua Obs 
=X — ( Dey Sai + (ng/ng) Ds Ypj)/n’ 


and Z, is unbiased. 


erren yr = number of population units falling in stratum B, 
Pp, = Nz/N, proportion of population falling in stratum B, 
Yp = population mean in stratum B, 
Aa Re 0 Pine Per ay toed ye Sed hee 
7K = the subsampling proportion = ng/ng. 


If the sampling proportion, 1/K, is assumed fixed (in our application, 1/K = .10 or .20), 
it follows (Cochran 1977) that the unconditional variance of Z, is, ignoring the fpc, 


Var(Zy) = Var(Z) + ¢,S2(Y), 


= [S*(Z) + Pg(K — 1)S3(Y)]/n’, 


where'c; = Pa(K'— 1)7/n": 
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Therefore the price paid for the reduction in cost due to not editing every schedule, is 
the increase in variance due to double sampling. This increase in variance looks potentially 
damaging because K is large. However, recall that Z = ¥ — Y, and the increase in variance 
is a function only of the variance of Y within subpopulation B. We expect S*(X) to 
dominate S*(Y), which should further dominate $%(Y), i.e. 


Dae nS ye SY a 


This is because the size of the variance is related to the mean value, and Y should be small 
compared to X. (For most items, we expect the amount misclassified to be small, compared 
to the original amount). Therefore we expect S%(Y) to be so much smaller than S2 (Z) that 
Pg(K — 1)S§(Y) will still be relatively small compared to S? (Z), and so the increase in 
variance due to subsampling will be relatively small. This is not guaranteed, but Czajka’s 
results bear this out, for most items (Czajka 1986). 


Hot Deck Imputation - Hot deck imputation was used, within adjustment cells, to 
reconstruct a rectangular data set. In particular, a return with schedules to be imputed 
was matched to a donor in group B, in the same adjustment cell, with these same 
schedules edited. 

Imputing the missing values of y with a hot deck procedure, using simple random sampl- 
ing, further increases the variance over using the double sampling estimate (27). However 
the additional increase in variance due to using hot deck imputation is small compared to 
the increase due to double sampling. This relative increase in variance due to imputing, denoted 
as Cc) in Table A, is bounded and in our case is small. (When K = 2, c, S 0.125. See, for 
example, Hansen, Hurwitz, and Madow 1953). 

As discussed in the paper, there is a problem with using an ordinary hot deck approach. 
If we simply estimate the unobserved y; value, on record i, with the observed value Ve 
from donor record j, then the resulting estimate of the final value Z; May not satisfy 
the edit checks. Additional corrections would have to be made to the record. Since the Original 
amount is always observed, it seemed more reasonable to ‘“‘hot deck’’ the relative change 
R = Y/X rather than the actual change Y. In addition to satisfying the edit checks, we 
expected the ratio procedure to reduce the variance of our estimates relative to the basic 
hot deck approach; however the variance of our estimator is not analytically tractable 
and must be measured empirically. Also, by introducing the ratio, our estimators are 
now biased. We conjectured that the biases would be small and in fact they were, for the 
most part, as seen in Table 2. In practice, the hot deck imputation was done within adjust- 
ment cells, created by post-stratifying the records into what we hope are homogeneous cells. 
The effect of this post-stratification should be to reduce variance and bias effects, but 
that is dependent on our skill in defining the imputation cells (an area with ample room 
for additional work). 


Ratio or Regression Estimation - We are also considering ratio (or regression) estimates 
within cells, instead of the hot deck estimates. For example, 7 = MAX wheres =py/x 
is calculated within appropriate cells. Referring to Table A, the increase in variance, V3, 
using the ratio estimator could be approximated using the formulas for the ratio estimator 
(e.g., Cochran 1977). However, these formulas are large sample approximations, and our 
sample sizes are almost always quite small. (In this case, the sample size is the number of 
donors, ng, in an adjustment cell.) Therefore, empirical results are needed here. 
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Similarly, the bias, b>, can be found using the results for ratio estimators. Unlike the hot 
deck ratio, the bias of the ratio estimator goes to zero as the sample size increases and in 
this sense the ratio estimator is more robust. In fact, the hot deck ratio estimator is unbiased 
only if the model Y = 6X is correct. (Of course, the bias of both estimators goes to zero 
as the fraction of missing data goes to zero). However, even if the model Y = (6X is incor- 
rect, the ratio estimator is consistent. 

There are of course many other options; multivariate regression models could be in- 
vestigated. We are still in the early stages of this project and we certainly have our work 
cut out for us now and in the upcoming years. 
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Comparison of Weighting and Imputation Methods 
for Estimating Unsampled Data 


SYLVIE MICHAUD! 


ABSTRACT 


The Canadian Census of Construction (COC) uses a complex plan for sampling small businesses (those 
having a gross income of less than $750,000). Stratified samples are drawn from overlapping frames. 
Two subsamples are selected independently from one of the samples, and more detailed information 
is collected on the businesses in the subsamples. There are two possible methods of estimating totals 
for the variables collected in the subsamples. The first approach is to determine weights based on sampling 
rates. A number of different weights must be used. The second approach is to impute values to the 
businesses included in the sample but not in the subsamples. This approach creates a complete ‘“‘rec- 
tangular” sample file, and a single weight may then be used to produce estimates for the population. 
This “large-scale imputation” technique is presently applied for the Census of Construction. The pur- 
pose of the study is to compare the figures obtained using various estimation techniques with the estimates 
produced by means of large-scale imputation. 


KEY WORDS: Weighting; Large-scale imputation; Unsampled. 


1. INTRODUCTION 


The Census of Construction (COC) is an annual survey which attempts to estimate ex- 
penses in the construction field. Although it is called a “census”, in fact only businesses 
having a gross income exceeding $750,000 are surveyed. Various financial and non-financial 
data are collected by means of a long questionnaire mailed to these firms. For businesses 
with a gross income between $10,000 and $750,000, expenses are estimated from a sample 
of administrative data. First, two samples are selected independently from overlapping 
sample frames. Two subsamples are then drawn from one of the samples in order to obtain 
additional information. 

Variables collected in the subsamples may be estimated in two different ways. The method 
currently used for the Census of Construction is to impute values for the businesses included 
in a sample, but not in a subsample. This creates a complete “rectangular” file, from which 
estimates for the overall population may be produced using only one weight. An alternative 
would be to calculate weights based on the probabilities of selection; these would have to 
be calculated separately for different subsets of data. The purpose of this study is to compare 
the estimates obtained by weighting with the estimates obtained by imputation. 

The study was carried out on a population of unincorporated businesses only because, 
for fiscal year 1983, the sample selection strategies for unincorporated and incorporated 
businesses were different. The strategy used for corporations will be modified for fiscal 1984 
to be equivalent to the strategy for unincorporated businesses. The strategy for unincorporated 
businesses was therefore examined. One hopes that the conclusions of this study will remain 
the same for incorporated businesses. | 


! S. Michaud, Business Survey Methods Division, Statistics Canada, 11th floor, R.H. Coats Building, Tunney’s Pasture, 
Ottawa, Ontario, Canada, KIA OT6. 
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2. DESCRIPTION OF THE SAMPLING PLAN 


As mentioned above, two independent samples are drawn from overlapping sample frames. 
The first is the prespecified sample selected for the Census of Construction; it is stratified 
by gross business income (GBI), province and 3-digit 1970 Standard Industrial Classifica- 
tion (SIC) code. The sample frame used is not completely up-to-date. It contains some 
‘‘deaths’’, i.e. businesses which are no longer within the scope of the COC for various reasons 
(a firm which no longer exists, is no longer engaged in a construction activity, or whose gross 
income is below $10,000). Furthermore, the sample frame does not contain ‘‘births’’ or 
businesses which have changed activities and are now part of the construction industry. The 
second sample is a ‘‘cross-sectional’’ sample, selected independently by Revenue Canada from 
a complete database containing businesses in all SIC groups (not only construction). It is 
used to estimate ‘‘births’’. This sample is stratified by Gross Business Income ranges. Figure 
1 below illustrates the situation. 

Two independent subsamples are selected from the units of the prespecified sample: a 
financial subsample and a subsample of ‘‘other characteristics’? (OC). The OC subsample 
is drawn directly from the prespecified sample, while the financial subsample is selected us- 
ing data transcribed from the sample (and so ‘‘deaths’’ are not subsampled). Further details 
concerning the sampling plan may be found in Giles (1983). 


“deaths” “prespecified alive businesses” “births” 


prespecified sample 


U 

! 
U 
L 


! 


i 
cross-sectional sample 


Figure 1. Representation of RC Sampling Plan 
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3. IMPUTATION TECHNIQUE 


The COC uses a large-scale imputation technique to estimate the variables selected in a 
given subsample (i.e. values are imputed for each variable, for all records not selected in 
the subsamples). The imputation is carried out independently for each subsample. (The im- 
putation is done in phases, and the imputation phases of the various subsamples are mutual- 
ly independent and apply different techniques.) In each phase, the nearest neighbour is chosen 
from a subset of potential donor records, and is used to impute the variables which were 
not sampled. 

The imputation is carried out differently for each subsample. 

In the case of the financial subsample, the imputed value is the donor’s value, adjusted 
by the ratio of an auxiliary variable which is available for both the donor and the candidate 
(the candidate being the record which is missing data to be imputed). (Note: The actual pro- 
cedure is more complicated: the variables are imputed hierarchically and linear constraints 
are placed on the imputed values (the second variable is dependent on the value imputed 
to the first variable, etc.). Additional information on this procedure may be found in Philips 
and Emery (1976). A more detailed overview is also provided in Colledge et al. (1978)). 

Suppose we use the following notation: 

Y: the variable of interest (known for the donors, to be imputed for the candidate) 
X: an auxiliary variable available for both the donor and the candidate 
c: denotes the candidate 
d: denotes the donor 
I: denotes an imputed value. 
For the financial subsample variables, the imputed value Y! is defined to be: 


Xe 
Xq 


For the OC subsample variables, the imputed value is simply the value on the donor record: 


Wier Ve 


y Sis 


The imputation procedure produces a complete rectangular file (the records of all the 
businesses that were selected in one of the samples contain values for all the variables of 
the samples/subsamples). Sampling weights may then be used to generate estimates for the 
overall population. 

The weight assigned to a given record is the inverse of the probability of it being selected 
into at least one of the samples. If we use the following notation: 

P(presp;,) - the probability of a record being selected in stratum h of the 
prespecified sample 

P(cross,) : the probability of a record being selected in stratum k of the cross- 
sectional sample 


hk : cross-classification of records 
h : denotes the stratum of the prespecified sample 
k : denotes the stratum of the cross-sectional sample, 


then the weight associated with each unit may be expressed as: 


Wi = 1 — [1 — P(presp,)] [1 — P(cross,)] 
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Births and deaths cannot be cross-classified. Deaths have a zero weight W;, = 0 and the 
weight of a birth, W,, is the inverse of the probability of being selected in stratum k of the 
cross-sectional sample. More details may be found in Bankier (1982). 

Therefore, when the imputation technique is used, the estimator of the total is 


Nhk 
Y= me Wk oe Vink 
hk = 


where Yifx = Vink if / € subsample 


= yj, if j € subsample. 


4. WEIGHTING TECHNIQUE 


If a weighting technique were used to estimate subsample variables, there would be a 
number of possible estimators. The estimators are in the same form for both subsamples, 
but different weights are used. 

The first estimator ( Y;) would be based on the sampling plan used, adjusted for under- 
coverage of the population. In each of the SIC, PROV and GBI strata (Standard Industrial 
Classification, province, gross business income), a prespecified sample is selected. Once they 
have been transcribed (units sampled and still alive), the units are classified to two strata: 
‘outside survey field’’ and ‘‘within survey field’’. The subsamples are chosen from the “‘within 
survey field’’ stratum. (We may assume that all the units in the ‘‘outside survey field’’ stratum 
have been subsampled and have a mean equal to zero.)The estimator contains a correction 
factor that compensates for undercoverage of the sample frame (calculated using informa- 
tion from the cross-sectional sample). 

The second possible estimator (Y>) is a simplified version of the first estimator, Yio 
stead of assuming a double sampling to determine ‘‘within survey field’’ and ‘‘outside survey 
field’’ units, we could assume that a prespecified stratified sample is selected from “‘within 
survey field’? units. A subsample is selected from the prespecified sample. The estimator 
must once again be adjusted to take undercoverage into account. If the differences between 
the first and second estimator turn out to be insignificant, the second would be a better choice 
because it is simpler. 

The third possible estimator (Y3) is an estimator based on data from the cross-sectional 
sample only. We could assume that the units selected in both the subsample and the cross- 
sectional sample are selected from the cross-sectional sample. The reasoning behind such 
an estimator is that the cross-sectional sample is drawn from a complete sample frame. 
However, since the subsamples are selected from the prespecified sample, and not from the 
cross-sectional sample, the size of the subsamples in the cross-sectional sample will be small. 

Finally, a fourth estimator (Y,) could be obtained by supposing that the subsample is 
selected from the complete sample (prespecified sample + cross-sectional sample), and that 
the complete sample comes from multiple frames. This fourth estimator is the one that most 
closely resembles the estimator obtained after large-scale imputation. Indeed, both of these 
estimators assume that births and new businesses ‘‘react’’ like the rest of the population. 
The imputation procedure does not make any special adjustment for such businesses, and 
the weighted estimator is not stratified in such a way as to distinguish these units. In addi- 
tion, both estimators take into account the fact that the sample comes from a number of 
frames. The same sampling weight is therefore used in both cases to produce data up to the 
population level. 
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As mentioned above, the variables collected in the financial subsample are adjusted by 
the ratio of an auxiliary variable during the imputation. 

We could therefore propose another type of estimator for the variables collected in the 
financial subsample: a ratio estimator. The auxiliary variable used would be the same one 
used for the imputation. As is the case for the simple weighting, different estimators could 
be calculated. 

The various estimators and their variances are described in mathematical terms in the 
Appendix. 


5. RESULTS 


In the study, four of the seven variables in the financial subsample were considered. 

As for the subsample of other characteristics, eight variables are collected for all businesses, 
while other variables are available for certain SIC groups only. The study was therefore limited 
to these eight variables. 

The variables in the financial subsample presented in this report are “‘ADD”’ (additions 
to fixed assets) and ‘‘RM?”’ (repair and maintenance). For the OC subsample, results are given 
for the variable ‘‘PCON”’ (percentage of construction in a specific field). However, the PCON 
variable is not published directly, but is multiplied by total expenses to obtain expenses in 
a specific field: PEXP. This second variable was the one studied. 

As mentioned earlier, the variables in the OC subsample are not adjusted by a ratio 
during the imputation procedure. The ratio estimators will therefore not apply to these 
variables. 

Tables 1, 2 and 3 provide values for the different estimators and estimates of their respec- 
tive variances, based on 1983 tax data for unincorporated businesses. 

In the first place, we see that there are no significant differences between the first two 
estimators. (According to the predetermined definitions, the second estimator is a simplified 
version of the first one.)The simplified version will therefore be retained. 


Table 1 


Estimated Values of PEXP (%EXP*EXPCONS) and Standard Deviation of PEXP 
———— 72 GUI G STIBY Si} 10 31) SIU ZO Ue taroiterit) Se) itt etysis2 


Y, 6 Y; Y, Y, 
Estimate (x 10!!) 3.44 3.43 3.96 3.66 2.40) 
Standard deviation (x 10°) 3.5 305 8.4 312 


Table 2 
Estimated Values of ADD and Standard Deviation of ADD 


Y, Y> Y; Y, Yoo Yo3 You Y, 
Estimate (x 108) 2. OSsalWDilOr yr2.14 maidesagini 7y82 bon5306 512 1.4 
Standard deviation (x 107) 1.9 1.9 2.0 Ove 2058 op) 0.8 


SS 
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Table 3 
Estimated Values of RM and Standard Deviation of RM 


Estimate (x 10 ay | be 1.43 £35 0.9 1.63 1.67 1.75 
Standard deviation ( x 10°) 6.9 6.9 8.9 5.3 sah 11.0 4.3 


In general, for the variables in the financial subsample, the imputation technique appears 
to yield results similar to those produced by the weighting method ( Y,). The estimator ob- 
tained by considering only units drawn from the cross-sectional sample (Y3) seems more 
variable than the other estimators. This variability could be explained by the smaller number 
of units used to calculate this estimator. It should be pointed out that these comparisons 
are based only on an observed sample, and so the conclusions are somewhat limited. However, 
owing to the nature of the data (often percentages and subdivisions of activity in the con- 
struction field), which is relatively stable in the strata (3-digit 1970 SIC, province and GBI), 
it was considered unnecessary to analyse these variables in greater depth. 

For the variables in the financial subsample, it was found that the estimators adjusted 
by the ratio do not always seem applicable (for example, the ADD variable). The estimates 
which they produce are extremely biased. One possible explanation is that the ADD variable 
and the auxiliary variable used have a high frequency of zero values. A ‘‘bad’’ sample in 
certain strata can thus inflate the estimates inordinately. 

Some problems were also encountered with the imputation system (data imputed when 
they should not have been, data not imputed), which in certain instances may have affected 
the estimates obtained by the imputation method. Since the results were based on an observ- 
ed sample only, and because it was difficult to estimate the impact of the system-related pro- 
blems, it was decided that a simulation would be done. 


6. SIMULATION 


The simulation was carried out using a data subset, namely those businesses that had been 
selected in the financial subsample (all of the variables studied are present for this data subset). 
Then an attempt was made to apply a simplified version of the technique used by the Census 
of Construction. A stratified sample was selected, using sampling rates similar to those of 
the survey. The variables of the financial subsample, for the data not selected in the sample, 
were considered as missing, and then imputed by the system. The sample selection process 
and the imputation were repeated thirty times. 

Estimates were produced, allowing us to compare the results obtained by summing the 
non-imputed and imputed data with the estimates produced using sampling weights equal 
to the inverse of the sampling rate. Since the value for the population is known, the bias 
and the variance of the estimates were calculated. The results for the ADD and RM variables 
are shown in Tables 4 and 5. 

For the ADD variable, the value produced by ratio estimation differs significantly from 
the estimates obtained by imputation or by weighting. The bias of the estimate is also 
significantly not null. For the RM variable, all the estimators are equivalent (equal variances, 
bias not significant at a 5% level, estimates not significantly different). 
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Table 4 
ADD Estimates Obtained by Simulation 


Population Weighting Ratio Imputation 
Estimate (x 107) 1.41 1.43 1.24 1.41 
Standard deviation (x 10°) Toul 85 1.15 
Bias (x 105) GS aie = (0.07 
Sat @schiteees ee OL balgins? Siau lo ssdmiwil”: am es)! 
Table 5 


RM Estimates Obtained by Simulation 


Population Weighting Ratio Imputation 
Estimate (x 107) 1.06 1.06 1.07 1.04 
Standard deviation (x 10°) 4.52 4.11 4.87 
Bias (x 105) = 0.07 — 0.95 =1.38 


ed SBR LL IO OSIRIA wrt gs” 
7. CONCLUSIONS 


According to the study results, there do not appear to be significant differences between 
the large-scale imputation technique and the weighting technique, for the variables in the 
other characteristics subsample. This was foreseeable, inasmuch as the variables studied seem 
to be relatively stable within each stratum. 

The conclusions for the variables in the financial subsample are based on the results of 
the simulation. These seem to indicate that the estimates obtained by weighting by the in- 
verse of the probability of selection are comparable to the estimates obtained from large- 
scale imputation. 

The ratio estimator does not appear appropriate for the ADD variable (or for the other 
variables analysed, but not discussed in this report). Continuation of the study will try to 
determine whether a regression estimator would be more appropriate, and to evaluate the 
impact of the imputation on the variable correlation structure. 
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APPENDIX 


The following notation may be used for the proposed estimators: 


h _: stratum of the prespecified sample 
k _: stratum of the cross-sectional sample 
Nn : size of the ‘“‘prespecified’’ population in stratum h 


Nip : size of the ‘‘prespecified’’ population with “‘alive businesses (within the 
scope of the survey) in stratum h (estimated) 
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N>,, : size of the ‘‘prespecified’’ population with businesses “‘outside the scope 
of the survey’’ in stratum A (estimated) 


N, : size of the population in stratum &, estimated using information from 
the cross-sectional sample 


N, : size of the population in stratum k, estimated using information from 
foth samples (multiple frames) 


n, : number of units sampled in stratum A of the prespecified sample 


fi, 2 number of units sampled and transcribed in stratum h of the prespecified 
sample 


fi, : number of units sampled and transcribed in stratum k 


“,;, 1 number of units subsampled from among ‘“‘alive’’ businesses in stratum h 


y : variable of one of the subsamples 

x: auxiliary variable available for all units of the samples 

Sh : estimate of the variance of y for the units of the subsample in stratum h 

s*, 1: estimatee of the variance of x for the units of the subsample in 
stratum h 


Syxn 2 estimate of the covariance of x and y in stratum h. 


i x N, pre-spec. a Noirths Np Ain mh 
i) iS = Silene Sp uy 
N, pre-spec h An Mp jz 
: Ni presspec + Npintns\2 NGS 
V( Y,) a ( shh rt ) > N, m( ) 
N, pre-spec h pa l 
] ] G), th 1 h 
x | si (— — a) eS At — = +. — Wip(l = Wn)? 
Yh h Nh Nh Yh Nh 
Nn — M, Mp Ayp, 
where G, = ( —— ). vr = ™ => and Wy, = —. 
Np — 1 Nip Np 


N, pre-spec. 


es A N, pre-spec. + Noirths Nip Mh 
gion gra estes nlinaoyren Sacre 
h 
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Ratio estimators may be calculated and, like simple estimators, they may take on different 
forms, depending on the hypotheses postulated. For example, the ratio estimator correspon- 
ding to estimator 4 would be: 


a £ ree samp 
You 7m Ne Nx Ysub, a 
k x sub, 


where X samp, 1S the mean of variable X for the units selected in the complete sample, 
which are in stratum k 


26 sub, 1S the mean of X for the units selected in the subsample, which are in 
stratum k 


NG, is the mean of variable Y in stratum k of the subsample. 


F 2) 1 1 ; E 1 1 
V(Yoa) = (N;)? ma) |e VUE en Pa carer —- —}s? 
vs k k k ee k 
k 


Miz Aix ny; IN; 


where R;, = 
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A Regression Approach to Estimation 
in the Presence of Nonresponse 


CARL ERIK SARNDAL! 


ABSTRACT 


In the presence of unit nonresponse, two types of variables can sometimes be observed for units in 
the “‘intended”’ sample s, namely, (a) variables used to estimate the response mechanism (the response 
probabilities), (b) variables (here called co-variates) that explain the variable of interest, in the usual 
regression theory sense. This paper, based on Sarndal and Swensson (1985 a, b), discusses nonresponse 
adjusted estimators with and without explicit involvement of co-variates. We conclude that the presence 
of strong co-variates in an estimator induces several favourable properties. Among other things, 
estimators making use of co-variates are considerably more resistant to nonresponse bias. We discuss 
the calculation of standard error and valid confidence intervals for estimators involving co-variates. 
The structure of the standard error is examined and discussed. 


KEY WORDS: Response mechanism; Adjustment group method; Co-variate; Robustness. 


1. INTRODUCTION 


We consider a finite population U = {1, ..., K, ..., N} from which a sample s of size 
n is drawn with a sampling design under which the k-th unit has the (strictly positive) pro- 
bability 7, of being selected. The sampling weight associated with the k-th unit is thus oe: 
We may admit a complex sampling design, not necessarily self-weighting, for example, a 
three-stage design with stratified selection of primary units. The probability under the design 
of jointly including the units k and / is denoted Tx (7x > O for all k #/, and z,, is inter- 
preted as equal to z;,). 


Given s, a certain unit nonresponse is assumed to occur. The responding subset of s is 
denoted by r, its size by m. The variable of interest, y, is observed for k € r only. To counteract 
the biasing effects of the nonresponse, we assume for the purpose of this paper that the widely 
used adjustment group method is employed: the sample s is subdivided into H groups 
S}, +105 Shy +++» Sx Of respective sizes m1, ..., Mp, ...5 Nz. The response set r is correspondingly 
divided into the subsets r,, ..., r_, .... Ty, of respective sizes 7, ..., My, ..., My. The 
response rate in group h is denoted f, = m;/n,. The method calls for attaching (in addi- 
tion to the sampling weight) the ‘‘adjustment weight”’ f, ' to an observation coming from 
group A. (The sizes and the composition of the adjustment groups at the population level 
are here assumed unknown.) We have: 


H H 
n= De Np, mM = ye Mp. 
h=1 h=1 


' Carl Erik Sarndal, Department of Mathematics and Statistics, University of Montreal, Montreal, Quebec, Canada, 
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Let ¢ = Ly y, be the unknown population total to be estimated. (If A is an arbitrary set 
of units, we shall systematically write L, y, for Ly.4 y,.). The usual adjustment class 
estimator of ¢ then becomes 


Yk 


H 
C= Se eee (1.1) 
h=1 


UU 


The adjustment group method is motivated theoretically by an assumption that units within 
the same group respond with the same (unknown) response probability. (More formally, this 
is expressed as Model A in Section 3 below.) The method clearly requires that group identity 
can be determined for each unit kes. The (categorical) variables that permit this grouping 
can thus be regarded as variables used for the estimation of an underlying response mechanism. 

A different category of variable may be observable for each kes, namely, variables that 
explain y, in the ordinary regression theory sense. These variables will be termed co-variates. 
When incorporated in the estimator, such variables will not only reduce variance but also 
make the estimator more resistent to nonresponse bias. (They are not auxiliary variables in 
the usual sense of this term, since they are available not for the entire population U but only 
for the intended sample s.) 

We shall thus keep a firm distinction in this paper between two types of variables observ- 
ed for kes, those that are used to estimate the response mechanism, and those that explain 
the target variable y. Little (1983), in presenting a general framework for data with 
nonresponse, distinguishes several types of variables. One attempt to describe our situation 
in terms of Little’s setup would be to say that the set of complete item variables in Little’s 
terminology are, in our case, further subdivided into one subset of variables used to model 
the nonresponse mechanism, and another subset (the co-variates) serving as explanatory 
variables for the incomplete item variable y. Our approach to inference is that of ‘‘quasi- 
randomization’’ (Oh and Scheuren 1983), where ‘‘quasi’’ refers to the fact that the non- 
response selection phase must be modelled, whereas the sample selection phase is controlled 
by the sampler. 


2. SOME SIMPLE NONRESPONSE ADJUSTED ESTIMATORS 
OF THE POPULATION TOTAL 


A slight development of the often seen formula (1.1) leads to a (generally somewhat 
‘‘better’’) alternative in which the sampling weights 1, ' can be said to be more fully used: 


The formula (which becomes identical to (1.1) for a self-weighting design) can be written 
as an expansion of the response set mean: 


texp = NY,, 
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namely, if we let the expansion factor be N = XY, 1/m,, and 


ris aay Vk 
sal orig 


Mi: = : 
=i]| is 
Lh Ling, (2.1) 


The symbol tilde will be used to indicate a properly weighted mean statistic. The ‘‘tilde mean”’ 
Y,, being a response set mean, is calculated by attaching to the k-th unit the multiplicative 
weight: 


sample weight x non response adjustment weight = 2, ! f,! 


for each unit k in the A-th adjustment group. 

The expansion estimator fpyp is appropriate for the nonresponse situation: it takes into 
account the sampling design and it makes an effort to adjust for nonresponse. However, 
fexp can be improved upon if more information is at hand. Suppose that a single (and 
always positive) co-variate x is also observed for kes. In the image of the classical ratio 
estimator, we can then construct 


The tilde mean X,, being formed at the level of the intended sample s, employs sample 
weights only. (This type of mean can be calculated for the x-variable, which is observed for 
all kes, but obviously not for y-variable, which is observed for ker only.) 

The classical regresson estimator formula corresponds, in our context, to 


trec = N{J, + b(%, — ¥)} 
with 
H 
D aus Vieni opie x Dd tie 
h=1 Me 
be 


H 

Saree} (x, — X,)?/ 1 
rh 

‘zs 
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(Note: sample weighting as well as nonresponse weighting is used in b too.) 
In summary, we have a series of three estimators 


ieee = NYJ;,; (2.2a) 

r ah Ve 

Ra = NXs—, (2b) 
xX; 

trac = N(¥, + b(% — %,)}. (2.2c) 


All three are properly sample weighted and nonresponse weighted. The obvious differences 
have to do with the co-variate: fgxp uses no co-variate, whereas fpa and fgg do. It is also 
clear that fg, appeals to an underlying relationship between y and the co-variate x in the 
form of a line through the origin, the slope of which is estimated by y,/X,. In the case of 
frac, the relationship is a regression with a non-zero intercept. We shall further explore the 
role of the co-variate. 

If the population size N is known, it is in general better to replace N by N in (2.2a) to 
(2.2c), yielding 


texp = NY,, (2.3a) 

te a, 

TRAY ANS = (2.3b) 
Ip 

a* = ‘, a 

treg = N{y, + D(X, — X,)}. (2.3c) 


For estimating the population total, N must be known in these three estimators, which 
may not be the case. However, for estimating the population mean Y, they lead, by dividing 
by N, to the convenient expressions 


Yexp = Jy» (2.4a) 

= 2, Ve 

Yea =k) (2.4b) 
r 

Vero =); +.B(% — x). (2.4c) 


The three series of estimators (2.2), (2.3), and (2.4) are easy to accept on intuitive grounds 
since all that is involved are elementary weighting principles, plus standard ratio feature or 
regression feature. Somewhat less elementary is to draw the proper consequences for variance 
estimation and the construction of valid confidence intervals. These questions are discussed 
in Section 4. (Contrary to what the rather informal presentation of the estimators (2.2) to 
(2.4) may suggest, the formulas are not ‘‘ad hoc’’ but the result of a formalized general estima- 
tion procedure (with a multivariate regression) for two phases of selection; see Sarndal and 
Swensson (1985a). Most importantly, the variance estimators and confidence intervals follow 
directly from this theory.) 
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3. RESPONSE MODELS 


The nonresponse weights in the estimators seen in Section 2 can be justified through a 
response mechanism model involving individual response probabilities that are constant for 
each unit in a given group. More formally, consider the response mechanism: 


MODEL A: 


(1) The probability of response is constant (and equal to an unknown constant 0,) for 
all units ke s,;h = 1, ee Sa 
(2) The units respond independently of each other. 


The theoretical response probabilities 0; may vary considerably between groups. (An in- 
dication that large differences in response propensity may exist between different subsets 
is, of course, an incentive to set up adjustment groups, and to weight accordingly.) 

Consider a fixed sample realization, s. The group frequencies Ny, ...5 Ans ..., Ny are then 
fixed. Let us also consider a fixed value of the vector of group response frequencies 
m = (mM, ..., My, .... My). With s and n fixed, the ‘‘selection’? under Model A of a 
Tesponse set 7, can be shown to conform to a simple random selection of m, from n,. The 
conditional response probability of a unit k in the A-th group is therefore 


Mh 
Tk\sm = a = fis all K eS). (3.1) 
h 


(This consideration underlies the weight f, | used in the estimators.) Similarly one can show 
that given s and m, the probability under Model A that units &k and / respond is 


fi ifk =1 
onemy, =y1) 
Rene ea if k = les, (3.2) 
be 
expt if kes, sles, (h # hh’) 


(7 kk\s,m is by definition equal to Tk|s,m-) These quantities (which remind us of stratified 
random sampling with m, units chosen from nN, in the A-th stratum) are important for the 
calculation of variance estimates and standard errors; see below. 

In practice, the analyst decides how to set up his groups s,. The decision is crucial, for 
it will determine the adjustment weights f, |, and thus the numerical value of the estimate 
of ¢, the variance estimate, and the confidence interval. Two different groupings may lead 
to widely different point estimates and confidence intervals. 

The analyst is not so naive as to think that response probabilities exist that are exactly 
equal within the group that he has identified. He does, however, believe (and usually with 
good reason) that more valid point estimates and confidence intervals will result with these 
groups (and thereby the weights f, ') than without them. The adjustment group approach 
is a sound and firmly established practice. 

On closer scrutiny, several things may be wrong with a response model such as Model 
A: the response probability is perhaps not constant within groups. And, even if it were, the 
particular groups postulated by the model are perhaps wrongly defined; there should have 
been more groups than assumed, etc. Two cases must therefore be distinguished for the con- 
tinued discussion: 
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(a) The assumed response mechanism (ARM; here in the form of Model A) is true. In 
practice, this is unlikely to be exactly the case. 

(b) The ARM is more or less false. This is the unpleasant truth in the majority of all prac- 
tical situations, and it leads to nonresponse bias. In the case of Model A, the groups 
may be formed more or less incorrectly. 

As is usual in statistics, the statistical analyst will formulate the model corresponding to 
the best of his judgement; accordingly, he will draw certain inferences (confidence statements, 
for example). Then he will wonder about the robustness of these conclusions, that is, how 
well do they hold up if the model is false? In the same order of things, let us consider these 
questions in our particular situation. 


4. VARIANCE ESTIMATORS BASED ON A CERTAIN 
ASSUMED RESPONSE MECHANISM 


Model A, with a specified set of groups, is assumed to hold. The response rates, f, = 
Nin Nisa. wm scl have been established. With this as a starting point, let us examine 
the variance estimators needed to construct a confidence interval at a specified 100(1 — a)% 
level. If fis one of the estimators in Section 2, and Model A really holds, we have: 


(a) ¢ is unbiased (except for a usually unimportant technical bias) 
(b) an approximately 100(1 — a)% confidence interval for ¢ is: 


t + 76 1H) VV(t), 


where the constant Z,_,/2 is exceeded with probability a/2 by the unit normal variate. 
Under repeated draws of samples s and, for each fixed s, repeated realizations (obeying 
the assumed Model A) of response sets r, the interval will contain the true population total 
100(1 — a)% of the time. 
The variance and the estimated variance will be determined by two sets of selection pro- 
babilities: 
1. 2, and z,,;, the probabilities of inclusion (first and second order) that accompany the 
sampling phase; 
2. Tjsm> Tki\sm the conditional response probabilities (first and second order) associated 
with the response Model A (‘‘the nonresponse phase’’). 


In our case, as a consequence of Model A, mxjs,m> aNd Tx 5,m> Aare given, respectively, 
by (3.1) and (3.2). As for a, and 7,;, full generality is assumed; any design may be used 
for the sampling phase. 

A detailed analysis will show that the total variance of any one of the estimators ¢ seen 
in Section 2 can be broken down into two components: 


V(t) = Vi (4) + V2(4) 


where V; (¢) may be termed the sampling variance and V3(t) the nonresponse variance. The 
exact formulas given in Sarndal and Swensson (1985a) are not reproduced here, but one notes 
that the components have some reasonable properties: 


1. V,(¢) = 0 if the whole population U is observed (a census rather than a sample 
survey); 
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2. V,(t) = 0 if the response is complete (r = sive 
3. V(t) is greatly reduced in the presence of a strong co-variate, but V,(f) is not af- 
fected by the co-variate (naturally enough, since it is observed for k € s only). 


Let us examine somewhat more closely the variance estimators. If V;(¢) denotes the 
estimator of V;(t), i = 1, 2, the total variance V(f) will be estimated by an expression of 
the form 


VORP, (A) 4 V5 Oe 


Here, the estimated sampling variance component is 


Vi(f) = My Wp ascaah : UxUy, 


TKO] Tki/ Tki\s,m 


where Tki|s,m 18 given by (3.2), and 1x, Tk, are the inclusion probabilities of the sampling 
design. The estimated nonresponse variance component is 


ae 1 1 
t, iN ae 2 ER Gwe 2 
V2(t) a ” Nh ( z) orp 
p= 
with 


1 
2 ra ) eer. 2 
Srp = 5 ia 1 rp, ( Wr Wr,) 


The quantities u, and w, differ from one estimator ¢ to another. Let us look first at the 
estimated nonresponse variance, V,(t). This component is of ‘‘stratified form’’: the factor 
np (1/m, — 1/n,,) is characteristic of a stratified simple random selection with m, units 
chosen from n, in the h-th stratum. The reason for this structure lies in the conditional 
response probabilities Txi\sm Ziven by (3.2). 

The quantities w, have the following appearance: 


Via Vy 


Wk 


a ak 
For texp and fexp: Wr = 


4 


i oe Ve — (Vr 1X Xx 
For [RA and GRA? Wr = ’ 
1k 


Vo a Ip we WX &,) 


Tk 


a a*k 
For [REG and LE REG: Wr = 


The expressions for w; are sample weighted regression residuals. Consequently, if x, is a 
powerful explanatory variable for Jz, One will ordinarily have that the variance of the Wy 
(and thus V;(f)) is smaller for the RA and REG estimators than for the EXP estimator, 
where the quantity w, is just a deviation of yx from the response set mean J,. Consequent- 
ly, in fortunate circumstances, the part of the standard error that is due to the nonresponse 
will be reduced to near-zero levels, namely, when x and y have near perfect correlation. 

The estimated sampling variance component V,(/) is of less interest in this discussion, 
since it is not directly influenced by the co-variate. It should be mentioned, however, that the 
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u, are determined as follows: texp» tra» and treG, Ux = Ye, while for the ‘‘starred’’ series 
of estimators fpxp, fra, and Prec» Ue = Ye — Is, Where Vs = (Ly Yy_/mK)/ (Ls 1/ mK) Is 
the mean of the predicted values from the regression fit, so that for poe Y, = J, for all 
k; for faa, Ye = (I-/%)xe3 and for Recs Ye = Ir — 1% — %). ar} 

A special case arises when m, = np for all h (that is, no nonresponse). Then V,(¢) = 0 
(as is reasonable), and txj5m = 1 for all k and /, leaving the non-zero component 


1 


Vides yy yy ( - — ) ys 


Te TT 
ker ler kel kl 


which is the well-known variance estimator for the case of full response. 


5. ROBUSTNESS PROPERTIES WHEN THE ASSUMED 
RESPONSE MECHANISM IS FALSE 


Unbiased estimates and valid confidence intervals can be obtained with the aforementioned 
estimators, provided the ARM (given by Model A) holds. The presence of a strong co-variate 
brings about a reduction of the nonresponse component of the variance. 

More interesting in a real-life situation is the case where the ARM breaks down. This case 
must be considered, because even the most careful judgement in setting up adjustment groups 
is bound to be less than perfect. The extent of the departure of the true response behaviour 
from that of the ARM will now determine behaviour of the various estimators. The statistical 
properties (bias, coverage rate achieved by confidence intervals, etc.) are in other words func- 
tions of the extent of model breakdown. 

In Sarndal and Swensson (1985a), a small scale Monte Carlo experiment was carried out 
to study the impact of certain types of breakdown in Model A. For purposes of illustration, 
we cite a few results from this study. 

The true ARM in the experiment had H = 4 adjustment groups, with different response 
probabilities between groups (but constant response probability for all units in the same 
group). 1,000 simple random samples were drawn, and each sample was exposed to simulated 
nonresponse according to the true ARM (which is taken as known, since this is a controlled 
experiment). 

As expected from theory, when the ARM underlying fsxp and fp, was true, there is 
essentially no bias, and the empirical coverage rates of the confidence intervals agree essen- 
tially with the nominal 95% rate. The advantage of fra lies in a smaller component of 
variance due to nonresponse. (See ‘‘ARM is true’’ in Table 1.) 

False ARM’s were created by joining together groups of the true ARM. The estimator 
and the confidence interval (based on the false ARM) will then be calculated on the basis 
of fewer groups than ought to be the case. The case ‘“‘ARM is false’’ in Table 1 represents 
the extreme situation where all four groups of the true ARM were joined into one, meaning 
that one acts in the estimation process as if all units throughout the population had the same 
(unknown, but estimated) response probability. The table shows that the co-variate estimator, 
tga, when compared to the no-co-variate estimator, fexp, has the following (not unexpected) 
advantages: (a) strong resistance to nonresponse bias (1.26 versus 4.85); (b) much better preser- 
vation of the nominal 95% confidence coefficient (92.6% versus 46.3% empirical coverage 
rate). In addition, fya has a variance advantage, and therefore shorter confidence intervals 
on the average. 
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Table 1 


Comparison of fgyp and fgg 


ee ee eee 


M es 
ae Absolute coe ae eesti 

stimator een variance - coverage rate 

component V, (95% nominal) 

ee eae ee ee an A 
ARM tee 0.00 1.99 95.2% 
eo io —0.01 0.78 95.5% 
ARM fase 4.85 2.55 46.3% 
is false i 1.26 0.78 92.6% 


a 


6. CONCLUSION 


In summary, we have argued in this paper that two different categories of variables (observ- 
ed for & in the intended sample s) are of importance: 


(a) variables suitable for estimating the response mechanism (in the case of Model A, these 
variables allow the construction of the adjustment groups); 

(b) variables (here called co-variates) that are powerful predictors of the y-variable; when 
used in the estimator formula, they reduce variance and improve the robustness pro- 
perties. 


Whenever possible, one should thus be on the outlook for suitable co-variates. One should 
also note that when several y-totals are to be estimated,the appropriate co-variates may dif- 
fer from one y-variable to the other, whereas the weighting classes would probably be set 
up to apply uniformly for all variables of interest. 
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Ratio Estimation with Subsampling the Nonrespondents 


PODURI S.R.S. RAO! 


ABSTRACT 


The procedure of subsampling the nonrespondents suggested by Hansen and Hurwitz (1946) is con- 
sidered. Post-stratification prior to the subsampling is examined. For the mean of a characteristic of 
interest, ratio estimators suitable for different practical situations are proposed and their merits are 
examined. Suitable ratio estimators are also suggested for the situations in which the Hard-Core are 
present. 


KEY WORDS: Auxiliary information; Post-stratification; Biases; Mean square errors; Linear model; 
Hard-Core. 


1. INTRODUCTION 


Consider a finite population of size N and a random sample of size n drawn without replace- 
ment. In surveys on human populations, frequently n, units respond on the items under ex- 
amination, but the remaining ( —7;) units do not provide any response. The initial survey 
may be conducted through the mail or telephone calls, perhaps computer-aided. 

In Sections 2, 3 and 4, we consider Hansen and Hurwitz’s (1946) procedure of subsam- 
pling a portion of the (7 —7;) nonrespondents. In this procedure the population is suppos- 
ed to be consisting of the response stratum of size N, and the nonresponse stratum of size 
N, = (N-N)). 

In Section 2, we discuss two procedures for post-stratifying the sampled units, prior to 
the subsampling of the nonrespondents. 

Two ratio estimators for the mean of an item are considered in Section 3. Biases and Mean 
Square errors of these estimators are compared in Sections 3 and 4. In Section 4, two more 
ratio estimators, which may be suitable for some practical situations, are proposed and their 
relative merits are examined. 

The Hard-Core problem is considered in Section 5. Six different estimators for this situa- 
tion are proposed. Optimum conditions suitable for each one of the estimators are briefly 
described. 


2. HANSEN AND HURWITZ’S ESTIMATOR AND 
POST-STRATIFICATION 


Consider a characteristic of interest y,, i= (1, 2, :..,.N). Let Y = (Lh yi) /N and 
S = ays — Y) *7(N — 1) denote the mean and variance of the population. Let Y; = 
(LM!y;)/N, and S? = LN (y; — Y,;)?/(N, — 1) denote the mean and variance of the 
response’ group. Similarly, let Y, = (2/2y,) / N, and S? = D)2(y, — Y,)2/(N> — 1) 
denote the mean and variance of the nonresponse group. The population 
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mean can be written as Y = W,Y, + W,Y>, where W, = (N,/N) and Wy = (N2/N). 
The sample mean j, = (Lf!y;)/n, is unbiased for Y,, but has a bias equal to 
W,(Y, — Y>) in estimating Y. 


2.1 Subsampling the Nonrespondents 

Hansen and Hurwitz (1946) suggest drawing a subsample of size m = n)/k, k = 1, from 
the n, nonrespondents and assume that responses are available from all of them. The sam- 
ple mean j>,, = (Lj"y;)/m is unbiased for the mean J, of the m2 units. The estimator for 
Y suggested by the above authors is 


Yur = W; 31 + Wr Jom> (2:1) 


where w, = (n,/n) and w, = (n/n). 

For a given set of m, respondents and n nonrespondents, this estimator is unbiased for 
p= wy + wy. = (Li y;)/n. Thus, it is unbiased for Y. 

The variance of this estimator is 


(as) (soa) 
n 


V (Yun) = Sees Se, (2.2) 
n 
where f = (n/N); see Cochran (1977, p. 371). 
Let s2 = DE" (y; — ¥,)2/(m, — 1) and s3,, = L(y; — Jam)*/(m — 1) denote the 
variances of the n; responses and the m subsampled units. An unbiased estimator of the 
variance is 


Gere = 


eqilingy sys? 4 (ny = sn | 
n n— 1 


i (1e=*/) Ez (HS PAV Ge ee 


n n— 1 


(N — 1)w2(k — 1) 53m 
N(n — 1) (2.3) 


This expression can also be obtained from the variance estimators for double sampling 
and stratification derived by Cochran (1977, p. 333) and Rao (1973); see also Rao (1983). 


Post-stratification and subsampling 


The (1 — n,) nonrespondents may be classified into (LZ — 1) strata of sizes (m2, 13, ..., Nz) 
according to an auxiliary characteristic, or for convenience in sampling at the next phase. 
Subsamples of size m, = (n;,/k;,), k, =1, provide the means Jpm = Ly"? yp;/m, and 
variances $7, = LI" (Yai — Pam)?/ (mp, — 1). 

The unbiased estimator for Y now is 


L 
Y= > WiYam> (2.4) 
1 


where w, = (n,/n) and Yim = Ji. 
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The variance of the above estimator is 


A C1 iy) LW (Ky — 1) 
jae = Ss Soe se (2.6) 
n n 


2 


where S,* =A Vni — Y, 7 (Ny — 1). The estimator for the variance is 


De es Ie aI impgt (lic Ament ine or) 


Y ee are = saa EO ce 
mf) n 5 (n — 1) si n Lu (n — 1) 
(N-1) < 
NO Lu Wa (Kn — 1) Shims (2.7) 


where k, = 1, Yim = J), and st, = sj as defined earlier. 


Other types of post-stratification may be considered. For instance, the 7 units, respondents 
as well as the nonrespondents, may be post-stratified into L strata according to an auxiliary 
variable. The A-th stratum will now have n,, respondents (Lf n,; = n,) with mean Jn, and 
Ny2 nonrespondents (L4 n,. = n>). A subsample of size m,2 = (My2/k,) from then,» units 
will provide the mean ¥,,,. An unbiased estimator for the mean Y, of the A-th stratum now is 


A Np V + Apoy 
Y= hi Yt h2Yh2m (2.8) 
Np 


where nN, = (Np, + Ny), and the unbiased estimator for FY is 


L is Si 

age Mnge ~ An Ynr + Nn2Vn2m 

Pts = cea ria 2.9) 
1 1 


The variance of this estimator and its estimate can be found as in the above case. 

The estimator in (2.4) is preferable if there is much difference among the means of the 
response and nonresponse strata. The estimator in (2.9) should be preferred if the means 
of the respondents and nonrespondents differ in each stratum, and if there is much difference 
among the means of the strata. 

Sarndal and Swensson (1985) consider unequal probabilities of selection at the first phase 
and subsampling the nonrespondents after post-stratification. 


3. RATIO ESTIMATORS 


Let x;, i = (1, 2,..., N), denote an auxiliary characteristic with population mean 
X = (Li x;)/N. Let X; and X, denote the means of the response and nonresponse 
groups. Let ¥ = (Lj x;)/n denote the mean of all the n units. Let x, = (Die /n, and 
X, = (L}? x;) /n,denote the means of the n; responding units and the ny nonresponding 
units. Further, let %,,, = (L7'x;) /m denote the mean of the m = (n2/k) subsampled units. 
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The population variances of x and y are denoted by S? and Sa and the population 
covariance by S,,. The correlation coefficient is py, = (Syy/S,S,). The sample variances 
are denoted by sz and so As before, the subscripts 1 and 2 denote the response and 
nonresponse groups. 


3.1 The Convential Estimator for the Mean 


The ratio estimator for Y is 


IS 


th =—X=rtx (3.1) 


* 


=I 


where j* is the same as Yay in (2.1), X* = (wi) X, + WX), and r* = (y*/X*); see 
Cochran (1977, p. 374). Now, 


fi pe — RX*)X ye 
ip Ses BN) scp BRE*). dl ba 


Lig ) (3.2) 


x* 


where R = (Y/X). The approximation in (3.2) is obtained by expressing (1 /x*) in Taylor’s 

series, and it is valid for large values of the sample sizes n and m. From (3.2) the bias of ¢; is 

seh ee) 
nX 


B, i of Ws = ¥) a 


W,(k — 1) 
RS? — S,,) + = 
\ y) nX 


(RS%o xy? )* (3.3) 


The bias vanishes only if (a) the regression of y on x goes through the origin for both 
the response and nonresponse strata and (b) the slopes of both the regressions are equal to 
R. The first condition is needed for the ratio estimator to be the optimum estimator for Y. 
For the second condition to be satisfied, Ry = (¥/X,) should not differ much from 


R, => GY, /X,). 
From (3.2), a large sample approximation to the Mean Square Error (MSE) of ¢; is 
he 1 —- k-1 
Men (ip ae v Sa + W, Sei Sin (3.4) 
n n 
Geri) beh Nana eel Keats 
a hivaise) 3 (NWa — 1) Sa as Wi oad) So (3.4a) 


n 4 (N51) 


where S3 = E(y; — Rx;)2/(N — 1) and $3, = 5p" (Yj — RXni)2/ (Nn — 1) for hh =1, 2. 
The expression in (3.4) is briefly indicated by Cochran (1977). 

An estimator for this MSE is obtained by replacing S32 in (3.4a) by 
shy = 2M (yj — r*x;)7/(m — 1), Sie by Saq = LT (i — 1*x))7/(m — 1) and W, by wy. 
It is possible to suggest alternative estimators for the above MSE. 


3.2 An Alternative Estimator for the Mean 


In some situations, there may not be any nonresponse on the auxiliary characteristic. Family 
size, years of education, years of employment, and the like, are the above type of auxiliary 
variables. 
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The subsample provides the means Xym and Y2,. However, since x = (“7 x;)/n is 
available, for Y we may consider 


* 


X = 


cy |S! 


WV, + WoJom = 


tb = x; (3.5) 
Bs 


Since the expectation of y* conditional on the first sample is equal to ¥, the bias in f, 
is the same as the one in Yr = (¥/X)X. We note that Yp is the ratio estimator for the case 
of complete response. This result can also be derived from the expression 


ERT pepe 
an a mally an ABS (3.6) 


x xX 


Since the conditional mean of $* is equal to 7, the bias of t, is 


B, = E(t, Ss Y) — 


] = 
wt (RS; — Sy). (3.7) 


If the regression of y on x for the entire population goes through the origin, the bias of 
t, in (3.7) vanishes. If the regression for the second stratum also goes through the origin, 
the bias of ¢; in (3.3) would be small only when R, = (¥>/X>) is close to R. 

From (3.6), the MSE of f, is 


a 1 - MO 
n 
2 
iN ft 3] 
_ LADEN ~ NSH yy =D gp 6.88 
n N-1 n 


Note that SZ = S} + R*SZ — 2RS,,. An estimator of this MSE is obtained by 
replacing S%,, Sy, Se and W;, by\s3,, Ss, s?y, and w, respectively, where 


i 


sa = )) OF ~ r**x)7/(m = 1), 
1 


m 
Sta = Y) (i — r**x,)?/(m-— 1), 
1 


Sx. = Y) Oi — xn)? /(m = 1). 
1 


In theses expressions, r** = (j*/X). 

Comparing the approximate expressions in (3.4) and (3.8), we find that when 
R, = (Y;/X,) does not differ much from Ry = (¥>/X>), t will have smaller MSE than 
t, provided the correlation p, in the nonresponse stratum is not too high. Secondly, if R, 
differs much from R,, t, may have smaller MSE than ¢, even when pj is high. The follow- 
ing Section contains further comparisons between these two estimators. 
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3.3. Further Comparisons 


In this Section, we compare ¢, and ¢f, through the linear model. For the two groups, we 
consider the models 


Vip = apo Px; + eye Hs 2.5 2N i) (3.9a) 
and 
Vo, = > + OX + Of b= C1, 2.4. N3), (3.9b) 
with the following assumptions: 
E(e; | x) = 0, Eley ei) = 0, Veevix;) = vixts 


E(e€; | x;) = 0, E(e2; €;) = 0, V(e2|x;) = s 


| 
SS 
x 


We note that (i # i’ ) and in practice ? may lie between zero and 2. Further e,; and @); 
are assumed to be uncorrelated. Biases and MSE’s of ¢; and f, are obtained in the Appen- 
dix with the assumption that the response group of size N, and the nonresponse group of 
size N, are samples from the super-populations represented by the above models. 


Comparisons of the biases 


Let J denote the observations from the first initial sample. Since E[(1/x*) |Z] = (1/X) 
and E(1/x) = (1/X), from (A.2) and (A.3) we find that both ¢, and f overestimate Y. 
Further the bias B, of ¢, is larger than the bias B, of f,. From (A.6) and (A.7), 


awW,(k — 1) S% 
B, — B= Mas e+ GAG) 


This difference in the biases increases with the size of the nonresponse stratum and decreases 
with an increase in the size of the subsample. 


Comparison of the MSE’s 
From (A.9) and (A.20), the difference in the MSE’s of ¢, and f, is 


Mi Mi = (Ay Aa) — Cot (Died, ). (3.11) 
From (A.10), (A.21), and (A.22), 
(A — Ax) — Cy = [3V(ay) + afy — 6°X?] 5 — Sir. (3.12) 
We note that 
at V(wW,) + a3 V(wW2) + 2a;a,Cov(w;, W2) 
N—n 


— (NEDA (a we a)? W, W. (3; 3) 


V(ay) 
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The difference in (3.12) becomes large as a, and a» differ much from each other. A suf- 
ficient condition for the right side of (3.12) to be nonnegative is that ay > 6X. Further 
analysis of this result shows that the above difference becomes large if Giri (S8,/X) 
becomes larger than Cy = (S,/Y) as the correlation Pxy = (S,,/S,S,) increases. 

From (A.12) and (A.24), 


D, — D, = E{[2(6 — 6*) + 3(6*2 — §2)]e*?} 
+ 2E[8* — 6 — 6*2 + 6?) Ba), (3.14) 


We note that (5* — 6) = (x* — x) /X = w, (Xam — X,)/X. Further, E(6* — 6) = 0. 
When ¢ = 0, from (3.14) and the results in (A.14) and (A.17), to 0(n~*), 


D, — D, = 3E|(8*? — 6?) e*?] — 26[ (6%? — 62) B at] 
3W,(k — 1)S2, of 2W,(k — 1)S%, - - 
ener Fr ee vy, + a Vink 
Te (Wy, 2V2) Nae (Wi v, 2V2) 
W,(k — 1) 
= {20 -f + 1](Wiy, + Mv.) + 3K - 1) Wyv>} = Sos 
(3.15) 
This expression clearly is nonnegative. 
When ¢ = 1, from (3.14), (A.15) and (A.16), to 0(n~') 
(WV, X1 + WrkV2X>)) 
Diginies 26 | (6 Siete s| 
n 
is 26 | (0 aS) (WV x) + | 
N 
(3.16) 
Noting that E[(5* — 5) x,|J] = 0, from (3.16), 
Di = 2 [ny rk Wwae, oe %) |v. + (2/N) E| win, (2a, — %)] v2 
= — (2/n)kE| w3V (Xon|D)]¥2 + (2/N) El w3V (Xom|L) | v2 
al ee 2) 
_ _ 2(Nk — n) W2(k — 1)S%, “ 3.17) 


Nn?X2 


Thus, when ? = 1, D, > D,. However, the difference in (3.17) becomes negligible when 
n is large. 

The above results suggest that when ¢ = 0, ¢, has larger MSE than t, if aw is larger than 
BX. When ¢ = 1, t, will have larger MSE than t, if w is considerably larger than BX. 
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4. SEPARATE RATIO ESTIMATORS 


4.1 The First Estimator 


If (X,,X>) are known, the separate ratio estimator for Y that can be suggested is 


Ys = wiry X\ ata W2I',X>, (4.1) 


where r; = (¥,/X,) and ry = (¥2/X,). However, (X,, X,) can be estimated by (X,, X2) 
and (J/X2m) is an estimator of r,. With these estimates, an estimator for Y is 


yee (4.2) 
Xam 


This estimator can be used if x, is available but X is not; however, it does not make 
use of X;. 
From (4.2) 


th — Y= (9 — Y) + Wy (X)/X2m) (Jam — Xam). (4.3) 


If m is large, from (4.3) the bias in f; is 


Cae) 


2 


B, = E(t; — Y) = (R2S% — Syo)- (4.4) 


The MSE of @; is 


rd Sl Eas) 
hey 


M, = E(t; — Y)? 


Wr(k — 1 
s2 + 2 ( ) 
n 


Shar (4.5) 
where Sq. = LN2 (yj; — R2x;)7/(N2 — 1). 
An estimator for this MSE is obtained by replacing the first term on the right of (4.5) by 


v(¥) = (1 — f)s2/n, Stra by Seago = LT (Vi — Fam%i)*/ (mm — 1), where rom = (Jrm/X2m)s 
and W, by wy. 


4.2 The Second Estimator 


An estimator that utilizes X and x is 


x. y XG 
t4 = ey (=) = (mis a wy ) &) A (4.6) 
x X2m xX 


It may be beneficial to consider this estimator since the conditional mean of ¢; for large m 
is equal to y, and hence the conditional expectation of t, becomes equal to (¥ 1X)2G 
From (4.6), 


= ue = ke ; Z X 
- Y= (xe r) muy) (= ) Wan — 25am (G )). (4.7) 
x Xm Xi 
If n and m are large, the bias of ¢4 is 
2 ad-/sf/ (k -— 1) 
Ba eB i torrel Pierre i R Shs Mg) SE at (RESINS (4.8) 
nX nX> 
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The MSE of ¢, is 


CK ai laps 


Bs 1 — 
M, = E(t, - yi2l Ds, > Sop 
n 


ION Sch Wak = 1) 
n N- 1 n 


An estimator of M, is obtained by replacing S3,, S2,, St, and W, by S21, Sto, S*s4piand 
W» respectively, where 


Sra: (4.9) 


= 
_ 


W-=— rx;)7/ (ny ial): 
(y; — rX;)?/(m = 1p 


Vi = Pm X;))*/(m — 1). 


II 


ts 
II 
-M:-M:- 


2 
S¥2d2 


We note that r* = (j*/x*) as defined in Section (3.1). 


Comparing (4.5) and (4.9), we find that ¢, will have smaller MSE than t; if the popula- 
tion correlation between x and y is high. 


Further investigation is needed to evaluate the merits of the above two separate estimators 
relative to the estimators in the previous Section. 


5S. RATIO ESTIMATORS IN THE PRESENCE OF THE HARDCORE 


It is becoming increasingly apparent that in spite of subsampling the nonrespondents and 
a number of call-backs, a significant proportion of the sampled units, the hard-core, do not 
respond to the items in the survey. 

For this situation, we consider the population to be composed of three groups of 
sizes (N,, No, N;),. N = 2} N;, with means (Y,, Y>, ¥;) and variances (Seis oo. Se 
The means and variances for the auxiliary characteristic are (X,, X>, X;) and 
(S%,, S%, S23). The population means of these two items are Y = (W,Y, + W,Y,+ W;Y3) 
and X = (W,X, + W,X, + W;X;), whereL} W; = 1. Let R, = (¥,/X), Ro = (¥/X)) 
and R; = (Oey 

In the initial sample of size n, only n, units respond and provide the means (Xioey ebb) 
number of units (”, 3) in the last two groups are not known, but their sum 
(m2 + n3) = (n — n,) is known. The means (x), x3) of the auxiliary characteristic may 
be known, but (j2, ¥3) for the item of interest are not observed. 

We consider the situation where in the subsample of size m = (n — n,)/k, only my, units 
respond and provide the means (Xm, Vom). The remaining m; = (m — my) units, the 
“‘hard-core’’, do not respond. Note that m, is not defined. 

In Rao and Jackson (1984), a number of estimators for Y for the above situation are ex- 
amined, without utilizing the auxiliary information. In this Section, we suggest the follow- 
ing six estimators that utilize the additional information. We briefly present the conditions 
for which these estimators may be the optimum ones. For the sake of space, we have not 
presented the derivations for these estimators. 
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(1). The difference between R,, Ry and R; is negligible. The m; units of the third 
group, the hard-core, is a random subsample of the m respondents at the 
second phase. In this case, 


i Nid uct (Neeeatly) Dope 
fee ae STW. (5.1) 
NX, + (n = nj )Xom 


(11). Same conditions as in I, but poor correlation in the second and third groups. 
For this case, 


ua Ny (EN an 
fin)is Ge eee F (5.2) 


nx 


(III). X; ao (Ni X, ae N,X7)/ (Ny a N>) and Y; = (N, Y, =f Nz Y2)/(N, ae N32), 
and (R;, R>, R;) do not differ much from each other. Under these conditions, 


bs NY, + kmM2Voum _ 
Vinee Sisal 0 Aion alias (5.3) 
NX, + km Xr» 


Note that, since E(m)/m) =n,/(n—n,), an unbiased estimator of ny is 
[(n = n,)/m| Mm, = km). 


(IV). Same conditions as in (III), but poor correlation in the second and third groups. 
For this case, 


ae NY, + Km2ym _ 


Yu4 = ———————_ Xx. 5.4 
He (ny, +kmy)% 83) 


(V). The three ratios differ from one another. The 7 units of the third group are a 
random subsample from the 7, units of the second group. In this case, 


A ny (N= 1) Yon X 
SG = 9 t 0) CAPR es ae Ph = )- OF) 
n n ie x 


(VI). The three ratios differ from one another. The 7; units of the third group are a 


random subsample from the (n; + n2) units of the first two groups. Under 
these conditions, 


; ( ny oa km, Yam _ ) (=) 6.6) 
Po R= oe a —__——_——_—  —_*¥. — }. i 
a ny ot km “ ny + km, Oy x 


While we expect the above conditions to be satisfactory, further research is needed to 
evaluate the performances of the above six estimators. 
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APPENDIX: BIASES AND MSE’S UNDER 
THE SUPER POPULATION MODEL 


Let ay = Wa, =F Wa, Ay = Wi) + Wa, 


N ny m 
= sek _* ee A 
E= ye e)/N, & = Y) e;/n, bm = ve e/mandé@ = wie, + wye,,. 


Now 
Ya op + 6X + E: (A.1) 
of RK | 
ty —- Y= ap Cita OMe ={ =o A i ES (A.2) 
i x 
and 
ay gee r* os a = 
b~ P= Foy —ay + 6(S-1)k+oR-B (A.3) 
x x 
1. Biases 


Eero = (x = ANAK and 6 i(k = X) 7X. Taylor’s expansion about X gives 
=? 
Eeete yo to (A.4) 
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With these expansions, from (A.2) and (A.3), to O(n!) the biases of t, and ¢, are 


* 
V(X ) Caan oo ( k= De, 
B, = ce ay = | ee a + say oes Aw (A.6) 
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B, = x ay = | ee Ss ay. (A.7) 
2. Mean Square Error of ¢, 
From the expansion in (A.4), 
x \ 2 
(=) soul ae ye 1S | (A.8) 


From (A.2), the MSE of ¢, can be written as 


Myre (ts tEY)* = A, Dy, (A.9) 
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where 
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From (A.16), when ¢ = 0, 
E(E@) = = omy + W vp). 
Similarly, when ¢ = 1, 
E(E£e’) = © Onn + WV>Xom). 
3. Mean Square Error of ¢, 
From the expansion in (A.5) 


ENS) 
e =1 — 25 + 362. 
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From (A.10), the MSE of 4 can be written as 
M, = E(t, — Y \ eae, + C, + Ds». 


With the expansions in (A.5) and (A.19) 
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With the expansions in (A.5) and (A.19) 
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We note that 
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i ~ ENE = (a — £)* — (25 — 367) e*? + 2(5 — 8?) Fer, 


E|e (a — ay) (— *) x| = E|(%* — %) (ay — aw)] = 0. 
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(A.17) 


(A.18) 


(A.19) 


(A.20) 
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GUIDELINES FOR MANUSCRIPTS 


Before having a manuscript typed for submission, please examine a recent issue (Vol. 10, 
No. 2 and onward) of Survey Methodology as a guide and note particularly the following 


points: 

L Layout 

1.1 Manuscripts should be typed on white bond paper of standard size (8% x 11 inch), 
one side only, entirely double spaced with margins of at least 1% inches on all sides. 

1.2 The manuscripts should be divided into numbered sections with suitable verbal titles. 

1.3. The name and address of each author should be given as a footnote on the first page 
of the manuscript. 

1.4 Acknowledgements should appear at the end of the text. 

1.5 Any appendix should be placed after the acknowledgements but before the list of 
references. 

2. Abstract 
The manuscript should begin with an abstract consisting of one paragraph followed 
by three to six key words. Avoid mathematical expressions in the abstract. 

Se Style 

3.1 Avoid footnotes, abbreviations, and acronyms. 

3.2 Mathematical symbols will be italicized unless specified otherwise except for functional 
symbols such as “exp(-)” and “log()”, etc. 

3.3 Short formulae should be left in the text but everything in the text should fit in single 
spacing. Long and important equations should be separated from the text and numbered 
consecutively with arabic numerals on the right if they are to be referred to later. 

3.4 Write fractions in the text using a solidus. 

3.5 Distinguish between ambiguous characters, (e.g., w, w; 0, O, 0; 1, 1). 

3.6 Italics are used for emphasis. Indicate italics by underlining on the manuscript. 

4. Figures and Tables 

4.1 All figures and tables should be numbered consecutively with arabic numerals, with 
titles which are as nearly self explanatory as possible, at the bottom for figures and 
at the top for tables. 

4.2 They should be put on separate pages with an indication of their appropriate place- 
ment in the text. (Normally they should appear near where they are first referred to). 

5; References 

5.1 References in the text should be cited with authors’ names and the date of publication. 
If part of a reference is cited, indicate after the reference, e.g., Cochran (1977, p. 164). 

5.2 The list of references at the end of the manuscript should be arranged alphabetically 


and for the same author chronologically. Distinguish publications of the same author 
in the same year by attaching a, b, c to the year of publication. Journal titles should 
not be abbreviated. Follow the same format used in recent issues. 
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